A while loop time complexity - python

I'm interested in determining the big O time complexity of the following:
def f(x):
r = x / 2
d = 1e-10
while abs(x - r**2) > d:
r = (r + x/r) / 2
return r
I believe this is O(log n). To arrive at this, I merely collected empirical data via the timeit module and plotted the results, and saw that a plot that looked logarithmic using the following code:
ns = np.linspace(1, 50_000, 100, dtype=int)
ts = [timeit.timeit('f({})'.format(n),
number=100,
globals=globals())
for n in ns]
plt.plot(ns, ts, 'or')
But this seems like a corny way to go about figuring this out. Intuitively, I understand that the body of the while loop involves dividing an expression by 2 some number k times until the while expression is equal to d. This repeated division by 2 gives something like 1/2^k, from which I can see where a log is involved to solve for k. I can't seem to write down a more explicit derivation, though. Any help?

This is Heron's (Or Babylonian) method for calculating the square root of a number. https://en.wikipedia.org/wiki/Methods_of_computing_square_roots
Big O notation for this requires a numerical analysis approach. For more details on the analysis you can check the wikipedia page listed or look for Heron's error convergence or fixed point iteration. (or look here https://mathcirclesofchicago.org/wp-content/uploads/2015/08/johnson.pdf)
Broad-strokes, if we can write the error e_n = (x-r_n**2) in terms of itself to where e_n = (e_n**2)/(2*(e_n+1))
Then we can see that e_n+1 <= min{(e_n**2)/2,e_n/2} so we have the error decrease quadratically. With the degrees of accuracy effectively doubling each iteration.
Whats different between this analysis and Big-O, is that the time it takes does NOT depend on the size of the input, but instead of the wanted accuracy. So in terms of input, this while loop is O(1) because its number of iterations is bounded by the accuracy not the input.
In terms of accuracy the error is bounded by above by e_n < 2**(-n) so we would need to find -n such that 2**(-n) < d. So log_2(d) = b such that 2^b = d. Assuming d < 2, then n = floor(log_2(d)) would work. So in terms of d, it is O(log(d)).
EDIT: Some more info on error analysis of fixed point iteration http://www.maths.lth.se/na/courses/FMN050/media/material/part3_1.pdf

I believe you're correct that it's O(log n).
Here you can see the successive values of r when x = 100000:
1 50000
2 25001
3 12502
4 6255
5 3136
6 1584
7 823
8 472
9 342
10 317
11 316
12 316
(I've rounded them off because the fractions are not interesting).
What you can see if that it goes through two phases.
Phase 1 is when r is large. During these first few iterations, x/r is tiny compared to r. As a result, r + x/r is close to r, so (r + x/r) / 2 is approximately r/2. You can see this in the first 8 iterations.
Phase 2 is when it gets close to the final result. During the last few iterations, x/r is close to r, so r + x/r is close to 2 * r, so (r + x/r) / 2 is close to r. At this point we're just improving the approximation by small amounts. These iterations are not really very dependent on the magnitude of x.
Here's the succession for x = 1000000 (10x the above):
1 500000
2 250001
3 125002
4 62505
5 31261
6 15646
7 7855
8 3991
9 2121
10 1296
11 1034
12 1001
13 1000
14 1000
This time there are 10 iterations in Phase 1, then we again have 4 iterations in Phase 2.
The complexity of the algorithm is dominated by Phase 1, which is logarithmic because it's approximately dividing by 2 each time.

Related

How to write pooling algoritm for lab work efficiency?

I was wondering if it makes sense to have a algorithm calculating the best combinations of samples to create pools in order to analyse each sample.
e.g.
I have 5 plant populations with different sizes
data = {'pop':[1,2,3,4,5],
'size':[23,45,65,31,43]}
The goal is to analyse each plant for one gene.
What I could do it to analyse each plant individually, but that may involve too much labour.
Therefore, I was thinking in pooling populations in order to minimize the labour involved.
e.g. I could simply do pool1 = pop1,pop2,pop3 | pool2 = pop4,pop5
However, then I was thinking why not do pool1 = pop2,pop5, pool2 = pop1,pop3, and pool3 = pop4
So I was wondering if there is a way to calculate the optimal combination of populations or even plants (It is possible to split the populations in every desired way).
So when e.g. pool1 (pop1,pop2,pop3) is positive (desired gene found) then how to proceed in order to get to the individual plant that is positive, i.e. How to split the pool most effectively in order to get to identify the positive plants. It is likely that multiple plants of one population are positive
Overall I want to minimize the number of 'runs'
It is known that the expected frequency of positives is 0.036
I hope the idea is clear and somebody has ideas on how to do that
Thanks
If you have N plants, and the frequency of positives is 0.036, then the total amount of information you get is -N(0.036 log2 0.036 + 0.964 log2 0.964) = 0.224N bits. See https://en.wikipedia.org/wiki/Entropy_(information_theory)
Ideally, since each run gives you a binary answer, you'll want to get a full bit out of each one, or at least as close to it as possible (and you'll therefore run just under N/4 runs in total). You get a full bit when the probability of a positive result is 50%. That takes 19 plants, so do your initial runs on batches of 19 plants.
After that, you'll probably get close enough to optimal by dividing each batch into halves and testing each half.
The initial batches require N/19 runs.
Then you have N/19 batches of size 10 to test.
You'll have N/16 batches of size 5 to test
N/15 of size 2.5.
For the N/30 positive batches of size 2.5, test each plant.
All together then, you have N(2/19+1/16+1/15+2.5/30) = 0.32N runs all together -- not too bad.
(note that #Stef's answer seems more efficient, but he got lucky in finding only 4 positives when 7 are expected :)
Let's try it:
import random
plants = [random.random() < 0.036 for _ in range(10000)]
nbuckets = len(plants)//19
buckets = [plants[i * len(plants)//nbuckets : (i+1) * len(plants)//nbuckets] for i in range(nbuckets)]
ntests = 0
def count_recursive(ar):
global ntests
if (len(ar)<=3):
# run each plant
ntests += len(ar)
return ar.count(True)
# run the batch
ntests += 1
if (ar.count(True) < 1):
return 0
mid = len(ar)//2
return count_recursive(ar[:mid]) + count_recursive(ar[mid:])
print("Num plants: {}".format(len(plants)))
print("Num Positives: {}".format(plants.count(True)))
foundPositives = sum(count_recursive(b) for b in buckets)
print("Found positives: {} ".format(foundPositives))
print("Num tests: {}".format(ntests))
Results:
Num plants: 10000
Num Positives: 368
Found positives: 368
Num tests: 3310
Num plants: 10000
Num Positives: 325
Found positives: 325
Num tests: 3076
Num plants: 10000
Num Positives: 387
Found positives: 387
Num tests: 3526
Yup, as expected.
We can also do better by skipping a test when the result is guaranteed positive, because everything else in a positive batch tested negative. That optimization brings to total number of tests down to 0.26N -- quite close to optimal.
Since the original partition of plants into populations is irrelevant to the question, I'll ignore it.
Since the frequency of positives is very low, I think a simply dichotomy search should be efficient. Occasionally, we will run into the situation that we split a positive pool into two subpools, and both subpools are positive, but since the frequency of positives is very low, we can hope that it won't happen too often.
import random
# random data
n = 23+45+65+31+43
data = [{'id': random.random(),
'positive': random.choices([True, False], weights=[36, 1000-36])[0]
} for _ in range(n)]
def test_pool(pool): # tests if a pool is positive
# serious science in the lab happens here
return any(d['positive'] for d in pool)
def get_positives(data):
result = []
nb_tests = 0
pools = [data]
while pools:
pool = pools.pop()
if len(pool) == 1:
result.append(pool[0])
else:
for subpool in [pool[:len(pool)//2], pool[len(pool)//2:]]:
nb_tests += 1
if test_pool(subpool):
pools.append(subpool)
return result, nb_tests
results, nb_tests = get_positives(data)
ground_truth = [d for d in data if d['positive']]
print('NUMBER OF TESTS: {}'.format(nb_tests))
print('FOUND POSITIVES:')
print([d['id'] for d in results])
print('GROUND TRUTH:')
print([d['id'] for d in ground_truth])
# NUMBER OF TESTS: 46
# FOUND POSITIVES:
# [0.2505629359502266, 0.46483641024238254, 0.8786751274491258, 0.250765592789725]
# GROUND TRUTH:
# [0.250765592789725, 0.8786751274491258, 0.46483641024238254, 0.2505629359502266]

Python Random Values with a Given Density

Say I want to build a maze with a certain probability of an obstacle at each position. This probability is determined by a density value ranging from 0 to 10, with 0 meaning "no chance", and 10 meaning "certain".
Does this Python code do what I want?
import random
obstacle_density = 10
if random.randint(0, 9) < obstacle_density:
print("There is an obstacle")
I've tried various combinations of upper and lower bounds and inequalities, and this seems to do the job, but I'm suspicious. For one thing, 11 possible values for obstacle_density and only 10 in random.randint(0, 9).
Not super sure about your solution. It seems like it would work, though.
Here's how I would approach it, even if it is a bit redundant - I'd start with a table just for my own reference:
density | probability of obstacle
---------------------------------
0 | 0%
1 | 10%
2 | 20%
3 | 30%
4 | 40%
5 | 50%
6 | 60%
7 | 70%
8 | 80%
9 | 90%
10 | 100%
This seems to add up. I present two versions of a function which returns True or False depending on the density. In the first version, I use the density to create the associated weights to be used with random.choices (the total weight in this case would be 100). For example, if density = 3, then weights = [30, 70] - 30% to be True, 70% to be False.
def get_obstacle_state_version_1(density):
from random import choices
assert isinstance(density, int)
assert density in range(0, 11) # 0 - 10 inclusive
true_weight = density * 10
false_weight = 100 - true_weight
weights = [true_weight, false_weight]
return choices([True, False], weights=weights, k=1)[0]
Here's the second version, in which I use random.choice rather than random.choices. The latter always returns a list of samples, even if the sample size k is 1.
Here, the idea is the same, but basically the density influences the number of Trues and Falses that appear in the population to be sampled. For example, if density = 3, then random.choice would pick one element from a list of 30 Trues, and 70 Falses with a uniform distribution.
def get_obstacle_state_version_2(density):
from random import choice
assert isinstance(density, int)
assert density in range(0, 11) # 0 - 10 inclusive
true_count = density * 10
false_count = 100 - true_count
return choice([True] * true_count + [False] * false_count)
You should loop over the maze and at each site assign a probability.
You should do something like this:
probability = random.randint(0, 10) / 10
I have no idea what you mean by obstacle_density, so I am not gonna go there.

Google CodeJam Past Exercise - Decrease runtime

I have been working on a past Google Codejam algorithm from 2010, but the time complexity is awful.
Here is the question from Google Codejam: https://code.google.com/codejam/contest/619102/dashboard
TLDR - Imagine two towers that have a number line running up the sides, we draw a line from one buildings number line (say from 10) to another point on the other buildings number line (say from 1). If we do this n times, how many times will those lines intersect?
I was wondering if anyone here is able to suggest a way in which I can speed up my algorithm? After 4 hours I really can't see one and I'm losing my miinnnnddd.
Here is my code as of right now.
An example input would be:
2 - (Number of cases)
3 - (Number of wires in case # 1)
1 10
5 5
7 7
Case #1: 2 - (2 intersections among lines 1,10 5,5 7,7)
2 - (Number of wires in case #2)
5 5
2 2
Case #2: 0 - (No lines intersect)
def solve(wire_ints, test_case):
answer_integer = 0
for iterI in range(number_wires):
for iterJ in range(iterI):
holder = [wire_ints[iterI], wire_ints[iterJ]]
holder.sort()
if holder[0][1] > holder[1][1]:
answer_integer = answer_integer + 1
return("Case #" + str(test_case) + ":" + " " + str(answer_integer))
for test_case in range(1, int(input()) + 1):
number_wires = int(input())
wire_ints = []
for count1 in range(number_wires):
left_port,right_port = map(int, input().split())
wire_ints.append((left_port,right_port))
answer_string = solve(wire_ints, test_case)
print(answer_string)
This algorithm does WORK for any input I give it, but as I said its very ugly and slow.
Help would be appreciated!
Since N is 1000 an algorithm with O(N^2) would be acceptable. So what you have to do is sort the wires by one of their end points.
//sorted by first number
1 10
5 5
7 7
Then you process each line from the beginning and check whether it has intersection with lines before it. If the second end point of a line before it is bigger than the second point of current line they have intersection. This requires two loops thus the O(N^2) complexity which suffice for N=1000. Also you can interpret this as an inversion count. you have to count the number of inversions of the second end points where the list is sorted by first end point.
10 5 7 ->‌ number of inversions is 2, because of (10,5) and (10,7)
Also there is O(NlogN) approach to count the number of inversions which you don't need for this question.

Algorithm for distributing tasks to two printers?

I am doing the programming exercise online, and I found this question:
Two printers work in different speed. The first printer produces one paper in x minutes, while the second does it in y minutes. To print N papers in total, how to distribute the tasks to those printers so the printing time is minimum?
The exercise gives me three inputs x,y,N and asks for the minimum time as output.
input data:
1 1 5
3 5 4
answer:
3 9
I have tried to set tasks for first printer as a, and the tasks for the second printer as N-a. The most efficient way to print is to let them have the same time, so the minimum time would be ((n*b)/(a+b))+1. However this formula is wrong.
Then I tried to use a brute force way to solve this problem. I first distinguished which one is smaller (faster) in a and b. Then I keep adding one paper to the faster printer. When the time needed for that printer is longer than the time to print one paper of the other printer, I give one paper to the slower printer, and subtract the time of faster printer.
The code is like:
def fastest_time (a, b, n):
""" Return the smalles time when keep two machine working at the same time.
The parameter a and b each should be a float/integer referring to the two
productivities of two machines. n should be an int, refering to the total
number of tasks. Return an int standing for the minimal time needed."""
# Assign the one-paper-time in terms of the magnitude of it, the reason
# for doing that is my algorithm is counting along the faster printer.
if a > b:
slower_time_each = a
faster_time_each = b
elif a < b :
slower_time_each = b
faster_time_each = a
# If a and b are the same, then we just run the formula as one printer
else :
return (a * n) / 2 + 1
faster_paper = 0
faster_time = 0
slower_paper = 0
# Loop until the total papers satisfy the total task
while faster_paper + slower_paper < n:
# We keep adding one task to the faster printer
faster_time += 1 * faster_time_each
faster_paper += 1
# If the time is exceeding the time needed for the slower machine,
# we then assign one task to it
if faster_time >= slower_time_each:
slower_paper += 1
faster_time -= 1 * slower_time_each
# Return the total time needed
return faster_paper * faster_time_each
It works when N is small or x and y are big, but it needs a lot of time (more than 10 minutes I guess) to compute when x and y are very small, i.e. the input is 1 2 159958878.
I believe there is an better algorithm to solve this problem, can anyone gives me some suggestions or hints please?
Given the input in form
x, y, n = 1, 2, 159958878
this should work
import math
math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y))
This works for all your sample inputs.
In [61]: x, y, n = 1,1,5
In [62]: math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y))
Out[62]: 3.0
In [63]: x, y, n = 3,5,4
In [64]: math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y))
Out[64]: 9.0
In [65]: x, y, n = 1,2,159958878
In [66]: math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y))
Out[66]: 106639252.0
EDIT:
This does not work for the case mentioned by #Antti i.e. x, y, n = 4,7,2.
Reason is that we are considering smaller time first. So the solution is to find both the values i.e. considering smaller time and considering larger time, and then choose whichever of the resultant value is smaller.
So, this works for all the cases including #Antii's
min((math.ceil((max((x,y)) / float(x+y)) * n) * min((x,y)),
math.ceil((min((x,y)) / float(x+y)) * n) * max((x,y))))
Although there might be some extreme cases where you might have to change it a little bit.

Poisson simulation not working as expected?

I have a simple script to set up a Poisson distribution by constructing an array of "events" of probability = 0.1, and then counting the number of successes in each group of 10. It almost works, but the distribution is not quite right (P(0) should equal P(1), but is instead about 90% of P(1)). It's like there's an off-by-one kind of error, but I can't figure out what it is. The script uses the Counter class from here (because I have Python 2.6 and not 2.7) and the grouping uses itertools as discussed here. It's not a stochastic issue, repeats give pretty tight results, and the overall mean looks good, group size looks good. Any ideas where I've messed up?
from itertools import izip_longest
import numpy as np
import Counter
def groups(iterable, n=3, padvalue=0):
"groups('abcde', 3, 'x') --> ('a','b','c'), ('d','e','x')"
return izip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
def event():
f = 0.1
r = np.random.random()
if r < f: return 1
return 0
L = [event() for i in range(100000)]
rL = [sum(g) for g in groups(L,n=10)]
print len(rL)
print sum(list(L))
C = Counter.Counter(rL)
for i in range(max(C.keys())+1):
print str(i).rjust(2), C[i]
$ python script.py
10000
9949
0 3509
1 3845
2 1971
3 555
4 104
5 15
6 1
$ python script.py
10000
10152
0 3417
1 3879
2 1978
3 599
4 115
5 12
I did a combinatorial reality check on your math, and it looks like your results are correct actually. P(0) should not be roughly equivalent to P(1)
.9^10 = 0.34867844 = probability of 0 events
.1 * .9^9 * (10 choose 1) = .1 * .9^9 * 10 = 0.387420489 = probability of 1 event
I wonder if you accidentally did your math thusly:
.1 * .9^10 * (10 choose 1) = 0.34867844 = incorrect probability of 1 event

Categories