How to randomly delete a number of lines from a big file?

How to randomly delete a number of lines from a big file? - python

I have a big text file of 13 GB with 158,609,739 lines and I want to randomly select 155,000,000 lines.
I have tried to scramble the file and then cut the 155000000 first lines, but it's seem that my ram memory (16GB) isn't enough big to do this. The pipelines i have tried are:
shuf file | head -n 155000000
sort -R file | head -n 155000000
Now instead of selecting lines, I think is more memory efficient delete 3,609,739 random lines from the file to get a final file of 155000000 lines.

As you copy each line of the file to the output, assess its probability that it should be deleted. The first line should have a 3,609,739/158,609,739 chance of being deleted. If you generate a random number between 0 and 1 and that number is less than that ratio, don't copy it to the output. Now the odds for the second line are 3,609,738/158,609,738; if that line is not deleted, the odds for the third line are 3,609,738/158,609,737. Repeat until done.
Because the odds change with each line processed, this algorithm guarantees the exact line count. Once you've deleted 3,609,739 the odds go to zero; if at any time you would need to delete every remaining line in the file, the odds go to one.

You could always pre-generate which line numbers (a list of 3,609,739 random numbers selected without replacement) you plan on deleting, then just iterate through the file and copy to another, skipping lines as necessary. As long as you have space for a new file this would work.
You could select the random numbers with random.sample
E.g.,
random.sample(xrange(158609739), 3609739)

Proof of Mark Ransom's Answer
Let's use numbers easier to think about (at least for me!):
10 items
delete 3 of them
First time through the loop we will assume that the first three items get deleted -- here's what the probabilities look like:
first item: 3 / 10 = 30%
second item: 2 / 9 = 22%
third item: 1 / 8 = 12%
fourth item: 0 / 7 = 0 %
fifth item: 0 / 6 = 0 %
sixth item: 0 / 5 = 0 %
seventh item: 0 / 4 = 0 %
eighth item: 0 / 3 = 0 %
ninth item: 0 / 2 = 0 %
tenth item: 0 / 1 = 0 %
As you can see, once it hits zero, it stays at zero. But what if nothing is getting deleted?
first item: 3 / 10 = 30%
second item: 3 / 9 = 33%
third item: 3 / 8 = 38%
fourth item: 3 / 7 = 43%
fifth item: 3 / 6 = 50%
sixth item: 3 / 5 = 60%
seventh item: 3 / 4 = 75%
eighth item: 3 / 3 = 100%
ninth item: 2 / 2 = 100%
tenth item: 1 / 1 = 100%
So even though the probability varies per line, overall you get the results you are looking for. I went a step further and coded a test in Python for one million iterations as a final proof to myself -- remove seven items from a list of 100:
# python 3.2
from __future__ import division
from stats import mean # http://pypi.python.org/pypi/stats
import random
counts = dict()
for i in range(100):
counts[i] = 0
removed_failed = 0
for _ in range(1000000):
to_remove = 7
from_list = list(range(100))
removed = 0
while from_list:
current = from_list.pop()
probability = to_remove / (len(from_list) + 1)
if random.random() < probability:
removed += 1
to_remove -= 1
counts[current] += 1
if removed != 7:
removed_failed += 1
print(counts[0], counts[1], counts[2], '...',
counts[49], counts[50], counts[51], '...',
counts[97], counts[98], counts[99])
print("remove failed: ", removed_failed)
print("min: ", min(counts.values()))
print("max: ", max(counts.values()))
print("mean: ", mean(counts.values()))
and here's the results from one of the several times I ran it (they were all similar):
70125 69667 70081 ... 70038 70085 70121 ... 70047 70040 70170
remove failed: 0
min: 69332
max: 70599
mean: 70000.0
A final note: Python's random.random() is [0.0, 1.0) (doesn't include 1.0 as a possibility).

I believe you're looking for "Algorithm S" from section 3.4.2 of Knuth (D. E. Knuth, The Art of Computer Programming. Volume 2: Seminumerical Algorithms, second edition. Addison-Wesley, 1981).
You can see several implementations at http://rosettacode.org/wiki/Knuth%27s_algorithm_S
The Perlmonks list has some Perl implementations of Algorithm S and Algorithm R that might also prove useful.
These algorithms rely on there being a meaningful interpretation of floating point numbers like 3609739/158609739, 3609738/158609738, etc. which might not have sufficient resolution with a standard Float datatype, unless the Float datatype is implemented using numbers of double precision or larger.

Here's a possible solution using Python:
import random
skipping = random.sample(range(158609739), 3609739)
input = open(input)
output = open(output, 'w')
for i, line in enumerate(input):
if i in skipping:
continue
output.write(line)
input.close()
output.close()
Here's another using Mark's method:
import random
lines_in_file = 158609739
lines_left_in_file = lines_in_file
lines_to_delete = lines_in_file - 155000000
input = open(input)
output = open(output, 'w')
try:
for line in input:
current_probability = lines_to_delete / lines_left_in_file
lines_left_in_file -= 1
if random.random < current_probability:
lines_to_delete -= 1
continue
output.write(line)
except ZeroDivisionError:
print("More than %d lines in the file" % lines_in_file)
finally:
input.close()
output.close()

I wrote this code before seeing that Darren Yin has expressed its principle.
I've modified my code to take the use of name skipping (I didn't dare to choose kangaroo ...) and of keyword continue from Ethan Furman whose code's principle is the same too.
I defined default arguments for the parameters of the function in order that the function can be used several times without having to make re-assignement at each call.
import random
import os.path
def spurt(ff,skipping):
for i,line in enumerate(ff):
if i in skipping:
print 'line %d excluded : %r' % (i,line)
continue
yield line
def randomly_reduce_file(filepath,nk = None,
d = {0:'st',1:'nd',2:'rd',3:'th'},spurt = spurt,
sample = random.sample,splitext = os.path.splitext):
# count of the lines of the original file
with open(filepath) as f: nl = sum(1 for _ in f)
# asking for the number of lines to keep, if not given as argument
if nk is None:
nk = int(raw_input(' The file has %d lines.'
' How many of them do you '
'want to randomly keep ? : ' % nl))
# transfer of the lines to keep,
# from one file to another file with different name
if nk<nl:
with open(filepath,'rb') as f,\
open('COPY'.join(splitext(filepath)),'wb') as g:
g.writelines( spurt(f,sample(xrange(0,nl),nl-nk) ) )
# sample(xrange(0,nl),nl-nk) is the list
# of the counting numbers of the lines to be excluded
else:
print ' %d is %s than the number of lines (%d) in the file\n'\
' no operation has been performed'\
% (nk,'the same' if nk==nl else 'greater',nl)

With the $RANDOM variable you can get a random number between 0 and 32,767.
With this, you could read in each line, and see if $RANDOM is less than 155,000,000 / 158,609,739 * 32,767 (which is 32,021), and if so, let the line through.
Of course, this wouldn't give you exactly 150,000,000 lines, but pretty close to it depending on the normality of the random number generator.
EDIT: Here is some code to get you started:
#!/bin/bash
while read line; do
if (( $RANDOM < 32021 ))
then
echo $line
fi
done
Call it like so:
thatScript.sh <inFile.txt >outFile.txt

Related

Google Kick Start 2020 Round C: Stable Wall. Always WA but can't find the problem

Problem Statement:
Problem
Apollo is playing a game involving polyominos. A polyomino is a shape made by joining together one or more squares edge to edge to form a single connected shape. The game involves combining N polyominos into a single rectangular shape without any holes. Each polyomino is labeled with a unique character from A to Z.
Apollo has finished the game and created a rectangular wall containing R rows and C columns. He took a picture and sent it to his friend Selene. Selene likes pictures of walls, but she likes them even more if they are stable walls. A wall is stable if it can be created by adding polyominos one at a time to the wall so that each polyomino is always supported. A polyomino is supported if each of its squares is either on the ground, or has another square below it.
Apollo would like to check if his wall is stable and if it is, prove that fact to Selene by telling her the order in which he added the polyominos.
Input
The first line of the input gives the number of test cases, T. T test cases follow. Each test case begins with a line containing the two integers R and C. Then, R lines follow, describing the wall from top to bottom. Each line contains a string of C uppercase characters from A to Z, describing that row of the wall.
Output
For each test case, output one line containing Case #x: y, where x is the test case number (starting from 1) and y is a string of N uppercase characters, describing the order in which he built them. If there is more than one such order, output any of them. If the wall is not stable, output -1 instead.
Limits
Time limit: 20 seconds per test set.
Memory limit: 1GB.
1 ≤ T ≤ 100.
1 ≤ R ≤ 30.
1 ≤ C ≤ 30.
No two polyominos will be labeled with the same letter.
The input is guaranteed to be valid according to the rules described in the statement.
Test set 1
1 ≤ N ≤ 5.
Test set 2
1 ≤ N ≤ 26.
Sample
Input
Output
4
4 6
ZOAAMM
ZOAOMM
ZOOOOM
ZZZZOM
4 4
XXOO
XFFO
XFXO
XXXO
5 3
XXX
XPX
XXX
XJX
XXX
3 10
AAABBCCDDE
AABBCCDDEE
AABBCCDDEE
Case #1: ZOAM
Case #2: -1
Case #3: -1
Case #4: EDCBA
In sample case #1, note that ZOMA is another possible answer.
In sample case #2 and sample case #3, the wall is not stable, so the answer is -1.
In sample case #4, the only possible answer is EDCBA.
  Syntax pre-check
Show Test Input
My Code:
class Case:
def __init__(self, arr):
self.arr = arr
def solve(self):
n = len(self.arr)
if n == 1:
return ''.join(self.arr[0])
m = len(self.arr[0])
dep = {}
used = set() # to save letters already used
res = []
for i in range(n-1):
for j in range(m):
# each letter depends on the letter below it
if self.arr[i][j] not in dep:
dep[self.arr[i][j]] = set()
# only add dependency besides itself
if self.arr[i+1][j] != self.arr[i][j]:
dep[self.arr[i][j]].add(self.arr[i+1][j])
for j in range(m):
if self.arr[n-1][j] not in dep:
dep[self.arr[n-1][j]] = set()
# always find and desert the letters with all dependencies met
while len(dep) > 0:
# count how many letters are used in this round, if none is used, return -1
count = 0
next_dep = {}
for letter in dep:
if len(dep[letter]) == 0:
used.add(letter)
count += 1
res.append(letter)
else:
all_used = True
for neigh in dep[letter]:
if neigh not in used:
all_used = False
break
if all_used:
used.add(letter)
count += 1
res.append(letter)
else:
next_dep[letter] = dep[letter]
dep = next_dep
if count == 0:
return -1
if count == 0:
return -1
return ''.join(res)
t = int(input())
for i in range(1, t + 1):
R, C = [int(j) for j in input().split()]
arr = []
for j in range(R):
arr.append([c for c in input()])
case = Case(arr)
print("Case #{}: {}".format(i,case.solve()))
My code successfully passes all sample cases I can think of, but still keeps getting WA when submitted. Can anyone spot what is wrong with my solution? Thanks

Python excersise gives wrong answers

Question here and my try below, cant complete no matter how I try.
N baskets are lined up, numbered 1 . . . N from left to right. The basket number i contains Ki
apples. John and Mary want to draw a line between two baskets, and then John would get all
the baskets to the left of the line and Mary all the baskets to the right of the line. Help them
draw the line to divide the apples as equally as possible!
Input. The first line of the file jagasis.txt contains N, the number of baskets (2 ≤ N ≤
1 000 000). Each of the following N lines contains an integer Ki
: the number of apples in basket
number i (1 ≤ i ≤ N, 0 ≤ Ki ≤ 10 000).
Output.
The only line of the file jagaval.txt should contain a single integer: the number
of the basket to the right of which the line should be drawn, so that the absolute value of the
difference between the number of apples John gets, and the number of apples Mary gets, would
be as small as possible. If there are multiple possible answers, output any one of them.
Example.
jagasis.txt
7
4
2
10
2
9
3
7
jagaval.txt
4
When the line is drawn between the fourth and the fifth basket, John gets 4 + 2 + 10 + 2 = 18
apples and Mary gets 9 + 3 + 7 = 19 apples. The difference between these numbers is 1, which
is the smallest possible
Here is my code, but its not working for some reason:
f = open("jagasis.txt")
inputs = []
for line in f.read().split():
inputs.append(int(line))
n=[]
location=[]
for x in range(inputs[0]):
n = inputs[1:]
m = n[:]
del n[:x]
m = set(m) - set(n)
jagamine=sum(m)/sum(n)
location.append(jagamine)
p=min(location, key=lambda x:abs(x-1))
uu = location.index(p)
print(location)
f = open("jagaval.txt", "w")
f.write(str(uu))

There's no reason to use set(). You should just sum the numbers before and after the dividing lines.
You should put the number of apples in a separate list from the number of baskets, so you don't have to keep skipping the first element of the list. And you don't need to make a copy and then delete things, use slices of the original list.
with open("jagasis.txt") as f:
count = int(f.readline().strip())
baskets = [int(strip(x)) for x in f]
sums = []
for x in range(1, count-1):
left = sum(baskets[:x])
right = sum(baskets[x:])
sums.append(abs(left - right))
result = index(min(sums)) + 1
with open("jagaval.txt", "w") as f:
f.write(str(result))
This is not the most efficient algorithm, this is O(n**2). A more efficient algorithm realizes that whenever you move the dividing line to the right, the element at the dividing line is added to the left sum and subtracted from the right sum. So if this is a coding competition, you should use that algorithm or you'll probably fail due to exceeding the time limit.

I just would like to add, that it would be nice if you've added what isn't working for you next time, as often many things may not "be working". Cheers.

Q: Expected number of coin tosses to get N heads in a row, in Python. My code gives answers that don't match published correct ones, but unsure why

I'm trying to write Python code to see how many coin tosses, on average, are required to get a sequences of N heads in a row.
The thing that I'm puzzled by is that the answers produced by my code don't match ones that are given online, e.g. here (and many other places) https://math.stackexchange.com/questions/364038/expected-number-of-coin-tosses-to-get-five-consecutive-heads
According to that, the expected number of tosses that I should need to get various numbers of heads in a row are: E(1) = 2, E(2) = 6, E(3) = 14, E(4) = 30, E(5) = 62. But I don't get those answers! For example, I get E(3) = 8, instead of 14. The code below runs to give that answer, but you can change n to test for other target numbers of heads in a row.
What is going wrong? Presumably there is some error in the logic of my code, but I confess that I can't figure out what it is.
You can see, run and make modified copies of my code here: https://trinket.io/python/17154b2cbd
Below is the code itself, outside of that runnable trinket.io page. Any help figuring out what's wrong with it would be greatly appreciated!
Many thanks,
Raj
P.S. The closest related question that I could find was this one: Monte-Carlo Simulation of expected tosses for two consecutive heads in python
However, as far as I can see, the code in that question does not actually test for two consecutive heads, but instead tests for a sequence that starts with a head and then at some later, possibly non-consecutive, time gets another head.
# Click here to run and/or modify this code:
# https://trinket.io/python/17154b2cbd
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 3
possible_tosses = [ 'h', 't' ]
num_trials = 1000
target_seq = ['h' for i in range(0,n)]
toss_sequence = []
seq_lengths_rec = []
for trial_num in range(0,num_trials):
if (trial_num % 100) == 0:
print 'Trial num', trial_num, 'out of', num_trials
# (The free version of trinket.io uses Python2)
target_reached = 0
toss_num = 0
while target_reached == 0:
toss_num += 1
random.shuffle(possible_tosses)
this_toss = possible_tosses[0]
#print([toss_num, this_toss])
toss_sequence.append(this_toss)
last_n_tosses = toss_sequence[-n:]
#print(last_n_tosses)
if last_n_tosses == target_seq:
#print('Reached target at toss', toss_num)
target_reached = 1
seq_lengths_rec.append(toss_num)
print 'Average', sum(seq_lengths_rec) / len(seq_lengths_rec)

You don't re-initialize toss_sequence for each experiment, so you start every experiment with a pre-existing sequence of heads, having a 1 in 2 chance of hitting the target sequence on the first try of each new experiment.
Initializing toss_sequence inside the outer loop will solve your problem:
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 4
possible_tosses = [ 'h', 't' ]
num_trials = 1000
target_seq = ['h' for i in range(0,n)]
seq_lengths_rec = []
for trial_num in range(0,num_trials):
if (trial_num % 100) == 0:
print('Trial num {} out of {}'.format(trial_num, num_trials))
# (The free version of trinket.io uses Python2)
target_reached = 0
toss_num = 0
toss_sequence = []
while target_reached == 0:
toss_num += 1
random.shuffle(possible_tosses)
this_toss = possible_tosses[0]
#print([toss_num, this_toss])
toss_sequence.append(this_toss)
last_n_tosses = toss_sequence[-n:]
#print(last_n_tosses)
if last_n_tosses == target_seq:
#print('Reached target at toss', toss_num)
target_reached = 1
seq_lengths_rec.append(toss_num)
print(sum(seq_lengths_rec) / len(seq_lengths_rec))
You can simplify your code a bit, and make it less error-prone:
import random
# n is the target number of heads in a row
# Change the value of n, for different target heads-sequences
n = 3
possible_tosses = [ 'h', 't' ]
num_trials = 1000
seq_lengths_rec = []
for trial_num in range(0, num_trials):
if (trial_num % 100) == 0:
print('Trial num {} out of {}'.format(trial_num, num_trials))
# (The free version of trinket.io uses Python2)
heads_counter = 0
toss_counter = 0
while heads_counter < n:
toss_counter += 1
this_toss = random.choice(possible_tosses)
if this_toss == 'h':
heads_counter += 1
else:
heads_counter = 0
seq_lengths_rec.append(toss_counter)
print(sum(seq_lengths_rec) / len(seq_lengths_rec))

We cam eliminate one additional loop by running each experiment long enough (ideally infinite) number of times, e.g., each time toss a coin n=1000 times. Now, it is likely that the sequence of 5 heads will appear in each such trial. If it does appear, we can call the trial as an effective trial, otherwise we can reject the trial.
In the end, we can take an average of number of tosses needed w.r.t. the number of effective trials (by LLN it will approximate the expected number of tosses). Consider the following code:
N = 100000 # total number of trials
n = 1000 # long enough sequence of tosses
k = 5 # k heads in a row
ntosses = []
pat = ''.join(['1']*k)
effective_trials = 0
for i in range(N): # num of trials
seq = ''.join(map(str,random.choices(range(2),k=n))) # toss a coin n times (long enough times)
if pat in seq:
ntosses.append(seq.index(pat) + k)
effective_trials += 1
print(effective_trials, sum(ntosses) / effective_trials)
# 100000 62.19919
Notice that the result may not be correct if n is small, since it tries to approximate infinite number of coin tosses (to find expected number of tosses to obtain 5 heads in a row, n=1000 is okay since actual expected value is 62).

What is the best way to calculate percentage of an iterating operation?

I've written a function that saves all numbers between two digit groups to a text file, with a step option to save some space and time, and I couldn't figure out how to show a percentage value, so I tried this.
for length in range(int(limit_min), int(limit_max) + 1):
percent_quotient = 0
j=0
while j <= (int(length * "9")):
while len(str(j)) < length:
j = "0" + str(j)
percent_quotient+=1
j = int(j) + int(step) # increasing dummy variable
for length in range(int(limit_min), int(limit_max) + 1):
counter=1
i = 0
while i <= (int(length * "9")):
while len(str(i)) < length:
i = "0" + str(i) #
print "Writing %s to file. Progress: %.2f percent." % (str(i),(float(counter)/percent_quotient)*100)
a.write(str(i) + "\n") # this is where everything actually gets written
i = int(i) + int(step) # increasing i
counter+=1
if length != int(limit_max):
print "Length %i done. Moving on to length of %i." % (length, length + 1)
else:
print "Length %i done." % (length)
a.close() # closing file stream
print "All done. Closed file stream. New file size: %.2f megabytes." % (os.path.getsize(path) / float((1024 ** 2)))
print "Returning to main..."
What I tried to do here was make the program do an iteration as many times as it would usually do it, but instead of writing to a file, I just made percent_quotient variable count how many times iteration is actually going to be repeated. (I called j dummy variable since it's there only to break the loop; I'm sorry if there is another expression for this.) The second part is the actual work and I put counter variable, and I divide it with percent_quotient and multiply with 100 to get a percentage.
The problem is, when I tried to make a dictionary from length of 1 to length of 8, it actually took a minute to count everything. I imagine it would take much longer if I wanted to make even bigger dictionary.
My question is, is there a better/faster way of doing this?

I can't really work out what this is doing. But it looks like it's doing roughly this:
a = file('d:/whatever.txt', 'wb')
limit_min = 1
limit_max = 5
step = 2
percent_quotient = (10 ** (limit_max - limit_min)) / step
for i in range(limit_min, 10**limit_max, step):
output = str(i).zfill(limit_max) + '\r\n'
a.write(output)
if i % 100 < 2:
print "Writing %s to file. Progress: %.2f percent." % (str(i),(float(i)/percent_quotient)*100)
a.close()
If that's right, then I suggest:
Do less code looping and more math
Use string.zfill() instead of while len(str(num)) < length: "0" + str(num)
Don't overwhelm the console with output every single number, only print a status update every hundred numbers, or every thousand numbers, or so.
Do less str(int(str(int(str(int(str(int(...
Avoid "" + blah inside tight loops, if possible, it causes strings to be rebuilt every time and it's particularly slow.

Okay, the step variable is giving me a lot of headache, but without it, this would be the right way to calculate how many numbers are going to be written.
percent_quota=0 #starting value
for i in range(limit_min,limit_max+1): #we make sure all lengths are covered
percent_quota+=(10**i)-1 #we subtract 1 because for length of 2, max is 99
TessellatingHeckler, thank you, your answer helped me figure this out!

Python interval interesction

My problem is as follows:
having file with list of intervals:
1 5
2 8
9 12
20 30
And a range of
0 200
I would like to do such an intersection that will report the positions [start end] between my intervals inside the given range.
For example:
8 9
12 20
30 200
Beside any ideas how to bite this, would be also nice to read some thoughts on optimization, since as always the input files are going to be huge.

this solution works as long the intervals are ordered by the start point and does not require to create a list as big as the total range.
code
with open("0.txt") as f:
t=[x.rstrip("\n").split("\t") for x in f.readlines()]
intervals=[(int(x[0]),int(x[1])) for x in t]
def find_ints(intervals, mn, mx):
next_start = mn
for x in intervals:
if next_start < x[0]:
yield next_start,x[0]
next_start = x[1]
elif next_start < x[1]:
next_start = x[1]
if next_start < mx:
yield next_start, mx
print list(find_ints(intervals, 0, 200))
output:
(in the case of the example you gave)
[(0, 1), (8, 9), (12, 20), (30, 200)]

Rough algorithm:
create an array of booleans, all set to false seen = [False]*200
Iterate over the input file, for each line start end set seen[start] .. seen[end] to be True
Once done, then you can trivially walk the array to find the unused intervals.
In terms of optimisations, if the list of input ranges is sorted on start number, then you can track the highest seen number and use that to filter ranges as they are processed -
e.g. something like
for (start,end) in input:
if end<=lowest_unseen:
next
if start<lowest_unseen:
start=lowest_unseen
...
which (ignoring the cost of the original sort) should make the whole thing O(n) - you go through the array once to tag seen/unseen and once to output unseens.
Seems I'm feeling nice. Here is the (unoptimised) code, assuming your input file is called input
seen = [False]*200
file = open('input','r')
rows = file.readlines()
for row in rows:
(start,end) = row.split(' ')
print "%s %s" % (start,end)
for x in range( int(start)-1, int(end)-1 ):
seen[x] = True
print seen[0:10]
in_unseen_block=False
start=1
for x in range(1,200):
val=seen[x-1]
if val and not in_unseen_block:
continue
if not val and in_unseen_block:
continue
# Must be at a change point.
if val:
# we have reached the end of the block
print "%s %s" % (start,x)
in_unseen_block = False
else:
# start of new block
start = x
in_unseen_block = True
# Handle end block
if in_unseen_block:
print "%s %s" % (start, 200)
I'm leaving the optimizations as an exercise for the reader.

If you make a note every time that one of your input intervals either opens or closes, you can do what you want by putting together the keys of opens and closes, sort into an ordered set, and you'll be able to essentially think, "okay, let's say that each adjacent pair of numbers forms an interval. Then I can focus all of my logic on these intervals as discrete chunks."
myRange = range(201)
intervals = [(1,5), (2,8), (9,12), (20,30)]
opens = {}
closes = {}
def open(index):
if index not in opens:
opens[index] = 0
opens[index] += 1
def close(index):
if index not in closes:
closes[index] = 0
closes[index] += 1
for start, end in intervals:
if end > start: # Making sure to exclude empty intervals, which can be problematic later
open(start)
close(end)
# Sort all the interval-endpoints that we really need to look at
oset = {0:None, 200:None}
for k in opens.keys():
oset[k] = None
for k in closes.keys():
oset[k] = None
relevant_indices = sorted(oset.keys())
# Find the clear ranges
state = 0
results = []
for i in range(len(relevant_indices) - 1):
start = relevant_indices[i]
end = relevant_indices[i+1]
start_state = state
if start in opens:
start_state += opens[start]
if start in closes:
start_state -= closes[start]
end_state = start_state
if end in opens:
end_state += opens[end]
if end in closes:
end_state -= closes[end]
state = end_state
if start_state == 0:
result_start = start
result_end = end
results.append((result_start, result_end))
for start, end in results:
print(str(start) + " " + str(end))
This outputs:
0 1
8 9
12 20
30 200
The intervals don't need to be sorted.

This question seems to be a duplicate of Merging intervals in Python.
If I understood well the problem, you have a list of intervals (1 5; 2 8; 9 12; 20 30) and a range (0 200), and you want to get the positions outside your intervals, but inside given range. Right?
There's a Python library that can help you on that: python-intervals (also available from PyPI using pip). Disclaimer: I'm the maintainer of that library.
Assuming you import this library as follows:
import intervals as I
It's quite easy to get your answer. Basically, you first want to create a disjunction of intervals based on the ones you provide:
inters = I.closed(1, 5) | I.closed(2, 8) | I.closed(9, 12) | I.closed(20, 30)
Then you compute the complement of these intervals, to get everything that is "outside":
compl = ~inters
Then you create the union with [0, 200], as you want to restrict the points to that interval:
print(compl & I.closed(0, 200))
This results in:
[0,1) | (8,9) | (12,20) | (30,200]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to randomly delete a number of lines from a big file? - python

Related

Google Kick Start 2020 Round C: Stable Wall. Always WA but can't find the problem

Python excersise gives wrong answers

Q: Expected number of coin tosses to get N heads in a row, in Python. My code gives answers that don't match published correct ones, but unsure why

What is the best way to calculate percentage of an iterating operation?

Python interval interesction

Categories

Resources