Efficiently Find Median of an Unordered Set of Data

Efficiently Find Median of an Unordered Set of Data - python

Background
I was looking into the statistics.median() function (link) in the Standard Library and decided to see how it was implemented in the source code. To my surprise, the median is calculated by sorting the entire data set and returning the "middle value".
Example:
def median(data):
data = sorted(data)
n = len(data)
if n == 0:
raise StatisticsError("no median for empty data")
if n % 2 == 1:
return data[n // 2]
i = n // 2
return (data[i - 1] + data[i]) / 2
This is an okay implementation for smaller data sets, but with large data sets, this can be costly.
So, I went through numerous sources and decided that the algorithm designed by Floyd and Rivest (link) would be the best for finding the median. Some of the other algorithms I saw are:
Quickselect
Introselect
I chose the Floyd-Rivest algorithm because it has an amazing average time complexity and seems resistant to cases such as the Median of 3s Killer Sequence.
Floyd-Rivest Algorithm
Python 3.10 with type hints
from math import (
exp,
log,
sqrt)
def sign(value: int | float) -> int:
return bool(value > 0) - bool(value < 0)
def swap(sequence: list[int | float], x: int, y: int) -> None:
sequence[x], sequence[y] = sequence[y], sequence[x]
return
def floyd_rivest(sequence: list[int | float], left: int, right: int, k: int) -> int | float:
while right > left:
if right - left > 600:
n: int = right - left + 1
i: int = k - left + 1
z: float = log(n)
s: float = 0.5 * exp(2 * z / 3)
sd: float = 0.5 * sqrt(z * s * (n - s) / n) * sign(i - n / 2)
new_left: int = max((left, int(k - i * s / n + sd)))
new_right: int = min((right, int(k + (n - i) * s / n + sd)))
floyd_rivest(sequence, new_left, new_right, k)
t: int | float = sequence[k]
sliding_left: int = left
sliding_right: int = right
swap(sequence, left, k)
if sequence[right] > t:
swap(sequence, left, right)
while sliding_left < sliding_right:
swap(sequence, sliding_left, sliding_right)
sliding_left += 1
sliding_right -= 1
while sequence[sliding_left] < t:
sliding_left += 1
while sequence[sliding_right] > t:
sliding_right -= 1
if sequence[left] == t:
swap(sequence, left, sliding_right)
else:
sliding_right += 1
swap(sequence, right, sliding_right)
if sliding_right <= k:
left = sliding_right + 1
if k <= sliding_right:
right = sliding_right - 1
return sequence[k]
def median(data: Iterable[int | float] | Sequence[int | float]) -> int | float:
sequence: list[int | float] = list(data)
length: int = len(sequence)
end: int = length - 1
midpoint: int = end // 2
if length % 2 == 1:
return floyd_rivest(sequence, 0, end, midpoint)
return (floyd_rivest(sequence, 0, end, midpoint) + floyd_rivest(sequence, 0, end, midpoint + 1)) / 2
Question
Apparently, the Floyd-Rivest algorithm does not perform as well with nondistinct data, for example, a list containing multiple 1s: [1, 1, 1, 1, 2, 3, 4, 5]. However, this was studied and seemingly solved by a person named Krzysztof C. Kiwiel who wrote a paper titled, "On Floyd and Rivest's SELECT algorithm". They modified the algorithm to perform better with nondistinct data.
My question is how would I implement/code Kiwiel's modified Floyd-Rivest algorithm?
Extra Considerations
In Kiwiel's paper, they also mention a non-recursive version of the algorithm. If you feel inclined, it would be nice to have an iterative algorithm to prevent overflowing stack frames (deep recursion). I am aware that a stack can be mimicked, but if you can find a way to rewrite the algorithm in such a way that it is elegantly written iteratively, that would be preferable.
Lastly, any input on speeding up the algorithm or using alternative ("better") algorithms is welcome! (I know Numpy has a median function and I know languages such as C would be more performant, but I am looking to find the "best" algorithm logic and not just generically making it faster)

Related

A Normal Distribution Calculator

so im trying to make a program to solve various normal distribution questions with pure python (no modules other than math) to 4 decimal places only for A Levels, and there is this problem that occurs in the function get_z_less_than_a_equal(0.75):. Apparently, without the assert statement in the except clause, the variables all get messed up, and change. The error, i'm catching is the recursion error. Anyways, if there is an easier and more efficient way to do things, it'd be appreciated.
import math
mean = 0
standard_dev = 1
percentage_points = {0.5000: 0.0000, 0.4000: 0.2533, 0.3000: 0.5244, 0.2000: 0.8416, 0.1000: 1.2816, 0.0500: 1.6440, 0.0250: 1.9600, 0.0100: 2.3263, 0.0050: 2.5758, 0.0010: 3.0902, 0.0005: 3.2905}
def get_z_less_than(x):
"""
P(Z < x)
"""
return round(0.5 * (1 + math.erf((x - mean)/math.sqrt(2 * standard_dev**2))), 4)
def get_z_greater_than(x):
"""
P(Z > x)
"""
return round(1 - get_z_less_than(x), 4)
def get_z_in_range(lower_bound, upper_bound):
"""
P(lower_bound < Z < upper_bound)
"""
return round(get_z_less_than(upper_bound) - get_z_less_than(lower_bound), 4)
def get_z_less_than_a_equal(x):
"""
P(Z < a) = x
acquires a, given x
"""
# first trial: brute forcing
for i in range(401):
a = i/100
p = get_z_less_than(a)
if x == p:
return a
elif p > x:
break
# second trial: using symmetry
try:
res = -get_z_less_than_a_equal(1 - x)
except:
# third trial: using estimation
assert a, "error"
prev = get_z_less_than(a-0.01)
p = get_z_less_than(a)
if abs(x - prev) > abs(x - p):
res = a
else:
res = a - 0.01
return res
def get_z_greater_than_a_equal(x):
"""
P(Z > a) = x
"""
if x in percentage_points:
return percentage_points[x]
else:
return get_z_less_than_a_equal(1-x)
print(get_z_in_range(-1.20, 1.40))
print(get_z_less_than_a_equal(0.7517))
print(get_z_greater_than_a_equal(0.1000))
print(get_z_greater_than_a_equal(0.0322))
print(get_z_less_than_a_equal(0.1075))
print(get_z_less_than_a_equal(0.75))

Since python3.8, the statistics module in the standard library has a NormalDist class, so we could use that to implement our functions "with pure python" or at least for testing:
import math
from statistics import NormalDist
normal_dist = NormalDist(mu=0, sigma=1)
for i in range(-2000, 2000):
test_val = i / 1000
assert get_z_less_than(test_val) == round(normal_dist.cdf(test_val), 4)
Doesn't throw an error, so that part probably works fine
Your get_z_less_than_a_equal seems to be the equivalent of NormalDist.inv_cdf
There are very efficient ways to compute it accurately using the inverse of the error function (see Wikipedia and Python implementation), but we don't have that in the standard library
Since you only care about the first few digits and get_z_less_than is monotonic, we can use a simple bisection method to find our solution
Newton's method would be much faster, and not too hard to implement since we know that the derivative of the cdf is just the pdf, but still probably more complex than what we need
def get_z_less_than_a_equal(x):
"""
P(Z < a) = x
acquires a, given x
"""
if x <= 0.0 or x >= 1.0:
raise ValueError("x must be >0.0 and <1.0")
min_res, max_res = -10, 10
while max_res - min_res > 1e-7:
mid = (max_res + min_res) / 2
if get_z_less_than(mid) < x:
min_res = mid
else:
max_res = mid
return round((max_res + min_res) / 2, 4)
Let's test this:
for i in range(1, 2000):
test_val = i / 2000
left_val = get_z_less_than_a_equal(test_val)
right_val = round(normal_dist.inv_cdf(test_val), 4)
assert left_val == right_val, f"{left_val} != {right_val}"
# AssertionError: -3.3201 != -3.2905
We see that we are losing some precision, that's because the error introduced by get_z_less_than (which rounds to 4 digits) gets propagated and amplified when we use it to estimate its inverse (see Wikipedia - error propagation for details)
So let's add a "digits" parameter to get_z_less_than and change our functions slightly:
def get_z_less_than(x, digits=4):
"""
P(Z < x)
"""
res = 0.5 * (1 + math.erf((x - mean) / math.sqrt(2 * standard_dev ** 2)))
return round(res, digits)
def get_z_less_than_a_equal(x, digits=4):
"""
P(Z < a) = x
acquires a, given x
"""
if x <= 0.0 or x >= 1.0:
raise ValueError("x must be >0.0 and <1.0")
min_res, max_res = -10, 10
while max_res - min_res > 10 ** -(digits * 2):
mid = (max_res + min_res) / 2
if get_z_less_than(mid, digits * 2) < x:
min_res = mid
else:
max_res = mid
return round((max_res + min_res) / 2, digits)
And now we can try the same test again and see it passes

Find the substring avoiding the use of recursive function

I am studying algorithms in Python and solving a question that is:
Let x(k) be a recursively defined string with base case x(1) = "123"
and x(k) is "1" + x(k-1) + "2" + x(k-1) + "3". Given three positive
integers k,s, and t, find the substring x(k)[s:t].
For example, if k = 2, s = 1 and t = 5,x(2) = 112321233 and x(2)[1:5]
= 1232.
I have solved it using a simple recursive function:
def generate_string(k):
if k == 1:
return "123"
part = generate_string(k -1)
return ("1" + part + "2" + part + "3")
print(generate_string(k)[s,t])
Although my first approach gives correct answer, the problem is that it takes too long to build string x when k is greater than 20. The program need to be finished within 16 seconds while k is below 50. I have tried to use memoization but it does not help as I am not allowed to cache each test case. I thus think that I must avoid using recursive function to speed up the program. Is there any approaches I should consider?

We can see that the string represented by x(k) grows exponentially in length with increasing k:
len(x(1)) == 3
len(x(k)) == len(x(k-1)) * 2 + 3
So:
len(x(k)) == 3 * (2**k - 1)
For k equal to 100, this amounts to a length of more than 1030. That's more characters than there are atoms in a human body!
Since the parameters s and t will take (in comparison) a tiny, tiny slice of that, you should not need to produce the whole string. You can still use recursion though, but keep passing an s and t range to each call. Then when you see that this slice will actually be outside of the string you would generate, then you can just exit without recursing deeper, saving a lot of time and (string) space.
Here is how you could do it:
def getslice(k, s, t):
def recur(xsize, s, t):
if xsize == 0 or s >= xsize or t <= 0:
return ""
smaller = (xsize - 3) // 2
return ( ("1" if s <= 0 else "")
+ recur(smaller, s-1, t-1)
+ ("2" if s <= smaller+1 < t else "")
+ recur(smaller, s-smaller-2, t-smaller-2)
+ ("3" if t >= xsize else "") )
return recur(3 * (2**k - 1), s, t)
This doesn't use any caching of x(k) results... In my tests this was fast enough.

Based on #FMc's answer, here's some python3 code that calculates x(k, s, t):
from functools import lru_cache
from typing import *
def f_len(k) -> int:
return 3 * ((2 ** k) - 1)
#lru_cache(None)
def f(k) -> str:
if k == 1:
return "123"
return "1" + f(k - 1) + "2" + f(k - 1) + "3"
def substring_(k, s, t, output) -> None:
# Empty substring.
if s >= t or k == 0:
return
# (An optimization):
# If all the characters need to be included, just calculate the string and cache it.
if s == 0 and t == f_len(k):
output.append(f(k))
return
if s == 0:
output.append("1")
sub_len = f_len(k - 1)
substring_(k - 1, max(0, s - 1), min(sub_len, t - 1), output)
if s <= 1 + sub_len < t:
output.append("2")
substring_(k - 1, max(0, s - sub_len - 2), min(sub_len, t - sub_len - 2), output)
if s <= 2 * (1 + sub_len) < t:
output.append("3")
def substring(k, s, t) -> str:
output: List[str] = []
substring_(k, s, t, output)
return "".join(output)
def test(k, s, t) -> bool:
actual = substring(k, s, t)
expected = f(k)[s:t]
return actual == expected
assert test(1, 0, 3)
assert test(2, 2, 6)
assert test(2, 1, 5)
assert test(2, 0, f_len(2))
assert test(3, 0, f_len(3))
assert test(8, 44, 89)
assert test(10, 1001, 2022)
assert test(14, 12345, 45678)
assert test(17, 12345, 112345)
# print(substring(30, 10000, 10100))
print("Tests passed")

This is an interesting problem. I'm not sure whether I'll have time to write the code, but here's an outline of how you can solve it. Note: see the better answer from trincot.
As discussed in the comments, you cannot generate the actual string: you will quickly run out of memory as k grows. But you can easily compute the length of that string.
First some notation:
f(k) : The generated string.
n(k) : The length of f(k).
nk1 : n(k-1), which is used several times in table below.
For discussion purposes, we can divide the string into the following regions. The start/end values use standard Python slice numbering:
Region | Start | End | Len | Subtring | Ex: k = 2
-------------------------------------------------------------------
A | 0 | 1 | 1 | 1 | 0:1 1
B | 1 | 1 + nk1 | nk1 | f(k-1) | 1:4 123
C | 1 + nk1 | 2 + nk1 | 1 | 2 | 4:5 2
D | 2 + nk1 | 2 + nk1 + nk1 | nk1 | f(k-1) | 5:8 123
E | 2 + nk1 + nk1 | 3 + nk1 + nk1 | 1 | 3 | 8:9 3
Given k, s, and t we need to figure out which region of the string is relevant. Take a small example:
k=2, s=6, and t=8.
The substring defined by 6:8 does not require the full f(k). We only need
region D, so we can turn our attention to f(k-1).
To make the shift from k=2 to k=1, we need to adjust s and t: specifically,
we need to subtract the total length of regions A + B + C. For k=2, that
length is 5 (1 + nk1 + 1).
Now we are dealing with: k=1, s=1, and t=3.
Repeat as needed.
Whenever k gets small enough, we stop this nonsense and actually generate the string so we can grab the needed substring directly.
It's possible that some values of s and t could cross region boundaries. In that case, divide the problem into two subparts (one for each region needed). But the general idea is the same.

Here's a commented iterative version in JavaScript that's very easy to convert to Python.
In addition to being what you asked for, that is non-recursive, it allows us to solve things like f(10000, 10000, 10050), which seem to exceed Python default recursion depth.
// Generates the full string
function g(k){
if (k == 1)
return "123";
prev = g(k - 1);
return "1" + prev + "2" + prev + "3";
}
function size(k){
return 3 * ((1 << k) - 1);
}
// Given a depth and index,
// we'd like (1) a string to
// output, (2) the possible next
// part of the same depth to
// push to the stack, and (3)
// possibly the current section
// mapped deeper to also push to
// the stack. (2) and (3) can be
// in a single list.
function getParams(depth, i){
const psize = size(depth - 1);
if (i == 0){
return ["1", [[depth, 1 + psize], [depth - 1, 0]]];
} else if (i < 1 + psize){
return ["", [[depth, 1 + psize], [depth - 1, i - 1]]];
} else if (i == 1 + psize){
return ["2", [[depth, 2 + 2 * psize], [depth - 1, 0]]];
} else if (i < 2 + 2 * psize){
return ["", [[depth, 2 + 2 * psize], [depth - 1, i - 2 - psize]]];
} else {
return ["3", []];
}
}
function f(k, s, t){
let len = t - s;
let str = "";
let stack = [[k, s]];
while (str.length < len){
const [depth, i] = stack.pop();
if (depth == 1){
const toTake = Math.min(3 - i, len - str.length);
str = str + "123".substr(i, toTake);
} else {
const [s, rest] = getParams(depth, i);
str = str + s;
stack.push(...rest);
}
}
return str;
}
function test(k, s, t){
const l = g(k).substring(s, t);
const r = f(k, s, t);
console.log(g(k).length);
//console.log(g(k))
console.log(l);
console.log(r);
console.log(l == r);
}
test(1, 0, 3);
test(2, 2, 6);
test(2, 1, 5);
test(4, 44, 45);
test(5, 30, 40);
test(7, 100, 150);

python, power function using one loop

I tried to solve a problem, writing a power by function that does the same job as the operator ** (by python for example)
after I solve it, I got another assignment:
I'm allowed to use only one loop and only one if\else.
I would love for some insight
I'm a beginer and have no clue how to go further.
my code was:
...
def power(x, y):
s = x
if y > 0:
for i in range (1, y):
s = s * x
elif (y < 0):
for i in range (y, -1):
s = s * x
s = 1 / s
else:
s = 1
return s
print(power(3, 5))
print(power(3, -5))
print(power(3, 0))

Are you allowed to use the abs function?
from typing import Union
def power(x: Union[float, int], y: int) -> Union[float, int]:
s: Union[float, int] = 1
for _ in range(abs(y)):
s *= x
if y < 0:
s = 1 / s
return s
assert power(3, 5) == 243
assert 0.0040 < power(3, -5) < 0.0042
assert power(3, 0) == 1

A way I would do it is creating a function that accepts a number and an exponent.
Then I would create a list with exp amount of that number. Multiply everything in the list together to get the result:
def power(num, exp):
prod = 1
powers = [num] * exp
for n in powers:
prod *= n
return prod

How to optimize str.replace() in Python

I am working on a binary string (i.e it only contains 1 and 0) and I need to run a function N number of times. This function replaces any instance of '01' in string to '10'. However, str.replace takes too much time to process the output, especially when the the length of string as well as N can be as big as 10^6.
I have tried implementing regex but it hasn't provided me with any optimization, instead taking more time to perform the task.
For example, if the string given to me is 01011 and N is equal to 1, then the output should be 10101. Similarly, if N becomes 2, the output becomes 11010 and so on.
Are there any optimizations of str.replace in python or is there any bit manipulation I could do to optimize my code?

Let's think of the input as bits forming an unsigned integer, possible a very large one. For example:
1001 1011 # input number X
0100 1101 # Y = X>>1 = X//2 -- to get the previous bit to the same column
1001 0010 # Z1 = X & ~Y -- We are looking for 01, i.e. 1 after previous 0
0001 0010 # Z2 = Z1 with the highest bit cleared, because we don't want
# to process the implicit 0 before the number
1010 1101 # output = X + Z2, this adds 1 where 01's are;
# 1 + 01 = 10, this is what we want
Thus we can process the whole list just with few arithmetic operations.
Update: sample code, I tried to address the comment about leading zeroes.
xstr = input("Enter binary number: ")
x = int(xstr, base=2)
digits = len(xstr)
mask = 2**(digits-1) - 1
print("{:0{width}b}".format(x,width=digits))
while True:
z2 = x & ~(x >> 1) & mask
if z2 == 0:
print("final state")
break
x += z2
print("{:0{width}b}".format(x,width=digits))

While this is not an answer to the actual replacement question, my preliminary investigations show that the flipping rule will eventually arrange all the 1s at the beginning of the string and all the 0s at the end, so the following function will give the correct answer if N is close to len(s).
from collections import Counter
def asymptote(s, N):
counts = Counter(s)
return '1'*counts['1'] + '0'*counts['0']
I compared the results with
def brute(s, N):
for i in range(N):
s = s.replace('01', '10')
return s
This graph shows where we have agreement between the brute force method and the asymptotic result for random strings
The yellow part is where the brute force and asymptotic result are the same. So you can see you need at least len(s)/2 flips to get to the asymptotic result most of the time and sometimes you need a bit more (the red line is 3*len(s)/4).

Here is the program I spoke of:
from typing import Dict
from itertools import product
table_1 = {
"01": 1,
"11": 0,
}
tables = {
1: table_1
}
def _apply_table(s: str, n: int, table: Dict[str, int]) -> str:
tl = n * 2
out = ["0"] * len(s)
for i in range(len(s)):
if s[i] == '1':
if i < tl:
t = '1' * (tl - i - 1) + s[:i + 1]
else:
t = s[i - tl + 1:i + 1]
o = table[t]
out[i - o] = '1'
return ''.join(out)
def _get_table(n: int) -> Dict[str, int]:
if n not in tables:
tables[n] = _generate_table(n)
return tables[n]
def _generate_table(n: int) -> Dict[str, int]:
def apply(t: str):
return _apply_table(_apply_table(t, n - 1, _get_table(n - 1)), 1, table_1)
tl = n * 2
ts = (''.join(ps) + '1' for ps in product('01', repeat=tl - 1))
return {t: len(apply(t).rpartition('1')[2]) for t in ts}
def transform(s: str, n: int):
return _apply_table(s, n, _get_table(n))
This is not very fast, but transform has a time-complexity of O(M) with M being the length of the string. But the space-complexity and the bad time complexity of the _generate_table function makes it unusable :-/ (It may however be possible that you can improve it, or implement it in C for faster speed. (It also gets better if you store the hash-tables and not recompute them every time)

Calculating square root using only integer math in python

I'm working on a microcontroller that does not support floating point math. Integer math only. As such, there is no sqrt() function and I can't import any math modules. The MCU is running a subset of python that supports eight Python data types: None, integer, Boolean, string, function, tuple, byte list, and iterator. Also, the MCU can't do floor division (//).
My problem is that I need to calculate the magnitude of 3 signed integers.
mag = sqrt(x**2+y**2+z**2)
FWIW, the values can only be in the range of +/-1024 and I just need a close approximation. Does anyone have a pattern for solving this problem?

Note that the largest possible sum is 3*1024**2, so the largest possible square root is 1773 (floor - or 1774 rounded).
So you could simply take 0 as a starting guess, and repeatedly add 1 until the square exceeds the sum. That can't take more than about 1770 iterations.
Of course that's probably too slow. A straightforward binary search can cut that to 11 iterations, and doesn't require division (I'm assuming the MCU can shift right by 1 bit, which is the same as floor-division by 2).
EDIT
Here's some code, for a binary search returning the floor of the true square root:
def isqrt(n):
if n <= 1:
return n
lo = 0
hi = n >> 1
while lo <= hi:
mid = (lo + hi) >> 1
sq = mid * mid
if sq == n:
return mid
elif sq < n:
lo = mid + 1
result = mid
else:
hi = mid - 1
return result
To check, run:
from math import sqrt
assert all(isqrt(i) == int(sqrt(i)) for i in range(3*1024**2 + 1))
That checks all possible inputs given what you said - and since binary search is notoriously tricky to get right in all cases, it's good to check every case! It doesn't take long on a "real" machine ;-)
PROBABLY IMPORTANT
To guard against possible overflow, and speed it significantly, change the initialization of lo and hi to this:
hi = 1
while hi * hi <= n:
hi <<= 1
lo = hi >> 1
Then the runtime becomes proportional to the number of bits in the result, greatly speeding smaller results. Indeed, for sloppy enough definitions of "close", you could stop right there.
FOR POSTERITY ;-)
Looks like the OP doesn't actually need square roots at all. But for someone who may, and can't afford division, here's a simplified version of the code, also removing multiplications from the initialization. Note: I'm not using .bit_length() because lots of deployed Python versions don't support that.
def isqrt(n):
if n <= 1:
return n
hi, hisq = 2, 4
while hisq <= n:
hi <<= 1
hisq <<= 2
lo = hi >> 1
while hi - lo > 1:
mid = (lo + hi) >> 1
if mid * mid <= n:
lo = mid
else:
hi = mid
assert lo + 1 == hi
assert lo**2 <= n < hi**2
return lo
from math import sqrt
assert all(isqrt(i) == int(sqrt(i)) for i in range(3*1024**2 + 1))

there is a algorithm to calculate it, but it use floor division, without that this is what come to my mind
def isqrt_linel(n):
x = 0
while (x+1)**2 <= n:
x+=1
return x
by the way, the algorithm that I know use the Newton method:
def isqrt(n):
#https://en.wikipedia.org/wiki/Integer_square_root
#https://gist.github.com/bnlucas/5879594
if n>=0:
if n == 0:
return 0
a, b = divmod(n.bit_length(), 2)
x = 2 ** (a + b)
while True:
y = (x + n // x) >> 1
if y >= x:
return x
x = y
else:
raise ValueError("negative number")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently Find Median of an Unordered Set of Data - python

Related

A Normal Distribution Calculator

Find the substring avoiding the use of recursive function

python, power function using one loop

How to optimize str.replace() in Python

Calculating square root using only integer math in python

Categories

Resources