Amortized O(1) rolling minimum implemented in Python Numba/NumPy

Amortized O(1) rolling minimum implemented in Python Numba/NumPy - python

I am trying to implement a rolling minimum that has an amortized O(1) get_min(). The amortized O(1) algorithm comes from the accepted answer in this post
Original function:
import pandas as pd
import numpy as np
from numba import njit, prange
def rolling_min_original(data, n):
return pd.Series(data).rolling(n).min().to_numpy()
My attempt to implement the amortized O(1) get_min() algorithm:(this function has decent performance for non-small n)
#njit
def rollin_min(data, n):
"""
brief explanations:
param: stk2: the stack2 in the algorithm, except here it only stores the min stack
param: stk2_top: it starts at n-1, and drops gradually until it hits -1 then it comes backup to n-1
if stk2_top= 0 in the current iteration(it will become -1 at the end):
that means stk2_top is pointing at the bottom element in stk2,
after it drops to -1 from 0, in the next iteration, stk2 will be reassigned to a new array data[i-n+1:i+1],
because we need to include the current index.
at each iteration:
if stk2_top <0: (i.e. we have 0 stuff in stk2(aka stk2_top <0)
- copy the past n items(including the current one) to stk2, so that stk2 has n items now
- pick the top min from stk2(stk2_top = n-1 momentarily)
- move down the pointer by 1 after the operation(n-1 becomes n-2)
else: (i.e. we have j(1<=j<= n-1) stuff in stk2)
- pick the top min from stk2(stk2_top is j-1 momentarily)
- move down the pointer by 1 after the operation(j-1 becomes j-2)
"""
if n >1:
def min_getter_rev(arr1):
arr = arr1[::-1]
result = np.empty(len(arr), dtype = arr1.dtype)
result[0]= local_min = arr[0]
for i in range(1,len(arr)):
if arr[i] < local_min:
local_min = arr[i]
result[i] = local_min
return result
result_min= np.empty(len(data), dtype= data.dtype)
for i in prange(n-1):
result_min[i] =np.nan
stk2 = min_getter_rev(data[:n])
stk2_top = n-2#it is n-2 because the loop starts at n(not n-1)which is the second non nan term
stk1_min = data[n-1]#stk1 needs to be the first item of the stk1
result_min[n-1]= min(stk1_min, stk2[-1])
for i in range(n,len(data)):
if stk2_top >= 0:
if data[i] < stk1_min:
stk1_min= min(data[i], stk1_min)#the stk1 min
result_min[i] = min(stk1_min, stk2[stk2_top])#min of the top element in stk2 and the current element
else:
stk2 = min_getter_rev(data[i-n+1:i+1])
stk2_top= n-1
stk1_min = data[i]
result_min[i]= min(stk1_min, stk2[n-1])
stk2_top -= 1
return result_min
else:
return data
A naive implementation when n is small:
#njit(parallel= True)
def rolling_min_smalln(data, n):
result= np.empty(len(data), dtype= data.dtype)
for i in prange(n-1):
result[i]= np.nan
for i in prange(n-1, len(data)):
result[i]= data[i-n+1: i+1].min()
return result
Some little code for testing
def remove_nan(arr):
return arr[~np.isnan(arr)]
if __name__ == '__main__':
np.random.seed(0)
data_size = 200000
data = np.random.uniform(0,1000, size = data_size)+29000
w_size = 37
r_min_original= rolling_min_original(data, w_size)
rmin1 = rollin_min(data, w_size)
r_min_original = remove_nan(r_min_original)
rmin1 = remove_nan(rmin1)
print(np.array_equal(r_min_original,rmin1))
The function rollin_min() has nearly constant runtime and lower runtime than rolling_min_original() when n is large, which is nice. But it has poor performance when n is low(around n < 37 in my pc, in this range rollin_min() can easily be beaten by a naive implementation rolling_min_smalln()).
I am struggling to find ways to improve rollin_min(), but so far I am stuck, which is why I am seeking for help here.
My questions are the following:
Is the algorithm I am implementing the best out there for rolling/sliding window min/max?
If not, what is the best/better algorithm? If so, how can I further improve the function from the algorithm's point of view?
Besides the algorithm itself, what other ways can further improve the performance of the function rollin_min()?
EDIT: Moved my latest answer to the answer section upon multiple requests

The primary cause of slowness in your code is probably the allocation of a new array in min_getter_rev. You should reuse the same storage throughout.
Then, because you don't really have to implement a queue, you can make more optimizations. For example the size of the two stacks is at most (and usually) n, so you you can keep them in the same array with size n. Grow one from the start and one from the end.
You would notice that there is a very regular pattern - fill the array from start to end in order, recalculate the minimums from the end, generate output as you refill the array, repeat...
This leads to an actually simpler algorithm with a simpler explanation that doesn't refer to stacks at all. Here is an implementation, with comments about how it works. Note that I didn't bother stuffing the start with NaNs:
def rollin_min(data, n):
#allocate the result. Note the number valid windows is len(data)-(n-1)
result = np.empty(len(data)-(n-1), data.dtype)
#every nth position is a "mark"
#every window therefore contains exactly 1 mark
#the minimum in the window is the minimum of:
# the minimum from the window start to the following mark; and
# the minimum from the window end the the preceding (same) mark
#calculate the minimum from every window start index to the next mark
for mark in range(n-1, len(data), n):
v = data[mark]
if (mark < len(result)):
result[mark] = v
for i in range(mark-1, mark-n, -1):
v = min(data[i],v)
if (i < len(result)):
result[i] = v
#for each window, calculate the running total from the preceding mark
# to its end. The first window ends at the first mark
#then combine it with the first distance to get the window minimum
nextMarkPos = 0
for i in range(0,len(result)):
if i == nextMarkPos:
v = data[i+n-1]
nextMarkPos += n
else:
v = min(data[i+n-1],v)
result[i] = min(result[i],v)
return result

Moved this from the Question EDIT section to here upon multiple requests.
Inspired by the simpler implementation given by Matt Timmermans in the answer, I have made a cpu-multicore version of the rolling min. The code is as follows:
#njit(parallel= True)
def rollin_min2(data, n):
"""
1) make a loop that iterates over K sections of n elements; each section is independent so that it can benefit from multicores cpu
2) for each iteration of sections, generate backward local minimum(sec_min2) and forward minimum(sec_min1)
say m=8, len(data)= 23, then we only need the idx= (reversed to 7,6,5,...,1,0 (0 means minimum up until idx= 0)),
1st iter
result[7]= min_until 0,
result[8]= min(min(data[7:9]) and min_until 1),
result[9]= min(min(data[7:10]) and m_til 2)
...
result[14]= min(min(data[7:15]) and m_til 7)
2nd iter
result[15]= min_until 8,
result[16]= min(min(data[15:17]) and m_til 9),
result[17]= min(min(data[15:18]) and m_til 10)
...
result[22]= min(min(data[15:23]) and m_til 15)
"""
ar_len= len(data)
sec_min1= np.empty(ar_len, dtype = data.dtype)
sec_min2= np.empty(ar_len, dtype = data.dtype)
for i in prange(n-1):
sec_min1[i]= np.nan
for sec in prange(ar_len//n):
s2_min= data[n*sec+ n-1]
s1_min= data[n*sec+ n]
for i in range(n-1,-1,-1):
if data[n*sec+i] < s2_min:
s2_min= data[n*sec+i]
sec_min2[n*sec+i]= s2_min
sec_min1[n*sec+ n-1]= sec_min2[n*sec]
for i in range(n-1):
if n*sec+n+i < ar_len:
if data[n*sec+n+i] < s1_min:
s1_min= data[n*sec+n+i]
sec_min1[n*sec+n+i]= min(s1_min, sec_min2[n*sec+i+1])
else:
break
return sec_min1
I have actually spent an hour testing various implementations of rolling min. In my 6C/12T laptop, this multi-core version works best when n is "medium size". When n is at least 30% of the length of the source data though, other implementation starts to outshine. There must be even better ways to improve this function, but at the time of this edit I am not aware of them just yet.

Related

Unique ordered ratio of integers

I have two ordered lists of consecutive integers m=0, 1, ... M and n=0, 1, 2, ... N. Each value of m has a probability pm, and each value of n has a probability pn. I am trying to find the ordered list of unique values r=n/m and their probabilities pr. I am aware that r is infinite if n=0 and can even be undefined if m=n=0.
In practice, I would like to run for M and N each be of the order of 2E4, meaning up to 4E8 values of r - which would mean 3 GB of floats (assuming 8 Bytes/float).
For this calculation, I have written the python code below.
The idea is to iterate over m and n, and for each new m/n, insert it in the right place with its probability if it isn't there yet, otherwise add its probability to the existing number. My assumption is that it is easier to sort things on the way instead of waiting until the end.
The cases related to 0 are added at the end of the loop.
I am using the Fraction class since we are dealing with fractions.
The code also tracks the multiplicity of each unique value of m/n.
I have tested up to M=N=100, and things are quite slow. Are there better approaches to the question, or more efficient ways to tackle the code?
Timing:
M=N=30: 1 s
M=N=50: 6 s
M=N=80: 30 s
M=N=100: 82 s
import numpy as np
from fractions import Fraction
import time # For timiing
start_time = time.time() # Timing
M, N = 6, 4
mList, nList = np.arange(1, M+1), np.arange(1, N+1) # From 1 to M inclusive, deal with 0 later
mProbList, nProbList = [1/(M+1)]*(M), [1/(N+1)]*(N) # Probabilities, here assumed equal (not general case)
# Deal with mn=0 later
pmZero, pnZero = 1/(M+1), 1/(N+1) # P(m=0) and P(n=0)
pNaN = pmZero * pnZero # P(0/0) = P(m=0)P(n=0)
pZero = pmZero * (1 - pnZero) # P(0) = P(m=0)P(n!=0)
pInf = pnZero * (1 - pmZero) # P(inf) = P(m!=0)P(n=0)
# Main list of r=m/n, P(r) and mult(r)
# Start with first line, m=1
rList = [Fraction(mList[0], n) for n in nList[::-1]] # Smallest first
rProbList = [mProbList[0] * nP for nP in nProbList[::-1]] # Start with first line
rMultList = [1] * len(rList) # Multiplicity of each element
# Main loop
for m, mP in zip(mList[1:], mProbList[1:]):
for n, nP in zip(nList[::-1], nProbList[::-1]): # Pick an n value
r, rP, rMult = Fraction(m, n), mP*nP, 1
for i in range(len(rList)-1): # See where it fits in existing list
if r < rList[i]:
rList.insert(i, r)
rProbList.insert(i, rP)
rMultList.insert(i, 1)
break
elif r == rList[i]:
rProbList[i] += rP
rMultList[i] += 1
break
elif r < rList[i+1]:
rList.insert(i+1, r)
rProbList.insert(i+1, rP)
rMultList.insert(i+1, 1)
break
elif r == rList[i+1]:
rProbList[i+1] += rP
rMultList[i+1] += 1
break
if r > rList[-1]:
rList.append(r)
rProbList.append(rP)
rMultList.append(1)
break
# Deal with 0
rList.insert(0, Fraction(0, 1))
rProbList.insert(0, pZero)
rMultList.insert(0, N)
# Deal with infty
rList.append(np.Inf)
rProbList.append(pInf)
rMultList.append(M)
# Deal with undefined case
rList.append(np.NAN)
rProbList.append(pNaN)
rMultList.append(1)
print(".... done in %s seconds." % round(time.time() - start_time, 2))
print("************** Final list\nr", 'Prob', 'Mult')
for r, rP, rM in zip(rList, rProbList, rMultList): print(r, rP, rM)
print("************** Checks")
print("mList", mList, 'nList', nList)
print("Sum of proba = ", np.sum(rProbList))
print("Sum of multi = ", np.sum(rMultList), "\t(M+1)*(N+1) = ", (M+1)*(N+1))

Based on the suggestion of #Prune, and on this thread about merging lists of tuples, I have modified the code as below. It's a lot easier to read, and runs about an order of magnitude faster for N=M=80 (I have omitted dealing with 0 - would be done same way as in original post). I assume there may be ways to tweak the merge and conversion back to lists further yet.
# Do calculations
data = [(Fraction(m, n), mProb(m) * nProb(n)) for n in range(1, N+1) for m in range(1, M+1)]
data.sort()
# Merge duplicates using a dictionary
d = {}
for r, p in data:
if not (r in d): d[r] = [0, 0]
d[r][0] += p
d[r][1] += 1
# Convert back to lists
rList, rProbList, rMultList = [], [], []
for k in d:
rList.append(k)
rProbList.append(d[k][0])
rMultList.append(d[k][1])

I expect that "things are quite slow" because you've chosen a known inefficient sort. A single list insertion is O(K) (later list elements have to be bumped over, and there is added storage allocation on a regular basis). Thus a full-list insertion sort is O(K^2). For your notation, that is O((M*N)^2).
If you want any sort of reasonable performance, research and use the best-know methods. The most straightforward way to do this is to make your non-exception results as a simple list comprehension, and use the built-in sort for your penultimate list. Simply append your n=0 cases, and you're done in O(K log K) time.
I the expression below, I've assumed functions for m and n probabilities.
This is a notational convenience; you know how to directly compute them, and can substitute those expressions if you wish.
data = [ (mProb(m) * nProb(n), Fraction(m, n))
for n in range(1, N+1)
for m in range(0, M+1) ]
data.sort()
data.extend([ # generate your "zero" cases here ])

Iteration performance

I made a function to evaluate the following problem experimentally, taken from a A Primer for the Mathematics of Financial Engineering.
Problem: Let X be the number of times you must flip a fair coin until it lands heads. What are E[X] (expected value) and var(X) (variance)?
Following the textbook solution, the following code yields the correct answer:
from sympy import *
k = symbols('k')
Expected_Value = summation(k/2**k, (k, 1, oo)) # Both solutions work
Variance = summation(k**2/2**k, (k, 1, oo)) - Expected_Value**2
To validate this answer, I decided to have a go at making a function to simulate this experiment. The following code is what I came up with.
def coin_toss(toss, probability=[0.5, 0.5]):
"""Computes expected value and variance for coin toss experiment"""
flips = [] # Collects total number of throws until heads appear per experiment.
for _ in range(toss): # Simulate n flips
number_flips=[] # Number of flips until heads is tossed
while sum(number_flips) == 0: # Continue simulation while Tails are thrown
number_flips.append(np.random.choice(2, p=probability)) # Append result to number_flips
flips.append(len(number_flips)) #Append number of flips until lands heads to flips
Expected_Value, Variance = np.mean(flips), np.var(flips)
print('E[X]: {}'.format(Expected_Value),
'\nvar[X]: {}'.format(Variance)) # Return expected value
The run time if I simulate 1e6 experiments, using the following code is approximately 35.9 seconds.
from timeit import Timer
t1 = Timer("""coin_toss(1000000)""", """from __main__ import coin_toss""")
print(t1.timeit(1))
In the interest of developing my understanding of Python, is this a particularly efficient/pythonic way of approaching a problem like this? How can I utilise existing libraries to improve efficiency/flow execution?

In order to code in an efficient and pythonic way, you must take a look at PythonSpeed and NumPy. One exemple of a faster code using numpy can be found below.
The abc of optimizing in python+numpy is to vectorize operations, which in this case is quite dificult because there is a while that could actually be infinite, the coin can be flipped tails 40 times in a row. However, instead of doing a for with toss iterations, the work can be done in chunks. That is the main difference between coin_toss from the question and this coin_toss_2d approach.
coin_toss_2d
The main advantatge of coin_toss_2d is working by chunks, the size of these chunks has some default values, but they can be modified (as they will affect speed). Thus, it will only iterate over the while current_toss<toss a number of times toss%flips_at_a_time. This is achieved with numpy, which allows to generate a matrix with the results of repeating flips_at_a_time times the experiment of flipping a coin flips_per_try times. This matrix will contain 0 (tails) and 1 (heads).
# i.e. doing only 5 repetitions with 3 flips_at_a_time
flip_events = np.random.choice([0,1],size=(repetitions_at_a_time,flips_per_try),p=probability)
# Out
[[0 0 0] # still no head, we will have to keep trying
[0 1 1] # head at the 2nd try (position 1 in python)
[1 0 0]
[1 0 1]
[0 0 1]]
Once this result is obtained, argmax is called. This finds the index corresponding to the maximum (which will be 1, a head) of each row (repetition) and in case of multiple occurences, returns the first one, which is exactly what is needed, the first head after a sequence of tails.
maxs = flip_events.argmax(axis=1)
# Out
[0 1 0 0 2]
# The first position is 0, however, flip_events[0,0]!=1, it's not a head!
However, the case where all the row is 0 must be considered. In this case, the maximum will be 0, and its first occurence will also be 0, the first column (try). Therefore, we check that all the maximums found at the first try correspond to a head at the first try.
not_finished = (maxs==0) & (flip_events[:,0]!=1)
# Out
[ True False False False False] # first repetition is not finished
If that is not the case, we loop repeating that same process but only for the repetitions where there was no head in any of the tries.
n = np.sum(not_finished)
while n!=0: # while there are sequences without any head
flip_events = np.random.choice([0,1],size=(n,flips_per_try),p=probability) # number of experiments reduced to n (the number of all tails sequences)
maxs2 = flip_events.argmax(axis=1)
maxs[not_finished] += maxs2+flips_per_try # take into account that there have been flips_per_try tries already (every iteration is added)
not_finished2 = (maxs2==0) & (flip_events[:,0]!=1)
not_finished[not_finished] = not_finished2
n = np.sum(not_finished)
# Out
# flip_events
[[1 0 1]] # Now there is a head
# maxs2
[0]
# maxs
[3 1 0 0 2] # The value of the still unfinished repetition has been updated,
# taking into account that the first position in flip_events is the 4th,
# without affecting the rest
Then the indexes corresponding to the first head occurence are stored (we have to add 1 because python indexing starts at zero instead of 1). There is one try ... except ... block to cope with cases where toss is not a multiple of repetitions_at_a_time.
def coin_toss_2d(toss, probability=[.5,.5],repetitions_at_a_time=10**5,flips_per_try=20):
# Initialize and preallocate data
current_toss = 0
flips = np.empty(toss)
# loop by chunks
while current_toss<toss:
# repeat repetitions_at_a_time times experiment "flip coin flips_per_try times"
flip_events = np.random.choice([0,1],size=(repetitions_at_a_time,flips_per_try),p=probability)
# store first head ocurrence
maxs = flip_events.argmax(axis=1)
# Check for all tails sequences, that is, repetitions were we have to keep trying to get a head
not_finished = (maxs==0) & (flip_events[:,0]!=1)
n = np.sum(not_finished)
while n!=0: # while there are sequences without any head
flip_events = np.random.choice([0,1],size=(n,flips_per_try),p=probability) # number of experiments reduced to n (the number of all tails sequences)
maxs2 = flip_events.argmax(axis=1)
maxs[not_finished] += maxs2+flips_per_try # take into account that there have been flips_per_try tries already (every iteration is added)
not_finished2 = (maxs2==0) & (flip_events[:,0]!=1)
not_finished[not_finished] = not_finished2
n = np.sum(not_finished)
# try except in case toss is not multiple of repetitions_at_a_time, in general, no error is raised, that is why a try is useful
try:
flips[current_toss:current_toss+repetitions_at_a_time] = maxs+1
except ValueError:
flips[current_toss:] = maxs[:toss-current_toss]+1
# Update current_toss and move to the next chunk
current_toss += repetitions_at_a_time
# Once all values are obtained, average and return them
Expected_Value, Variance = np.mean(flips), np.var(flips)
return Expected_Value, Variance
coin_toss_map
Here the code is basically the same, but now, the intrinsec while is done in a separate function, which is called from the wrapper function coin_toss_map using map.
def toss_chunk(args):
probability,repetitions_at_a_time,flips_per_try = args
# repeat repetitions_at_a_time times experiment "flip coin flips_per_try times"
flip_events = np.random.choice([0,1],size=(repetitions_at_a_time,flips_per_try),p=probability)
# store first head ocurrence
maxs = flip_events.argmax(axis=1)
# Check for all tails sequences
not_finished = (maxs==0) & (flip_events[:,0]!=1)
n = np.sum(not_finished)
while n!=0: # while there are sequences without any head
flip_events = np.random.choice([0,1],size=(n,flips_per_try),p=probability) # number of experiments reduced to n (the number of all tails sequences)
maxs2 = flip_events.argmax(axis=1)
maxs[not_finished] += maxs2+flips_per_try # take into account that there have been flips_per_try tries already (every iteration is added)
not_finished2 = (maxs2==0) & (flip_events[:,0]!=1)
not_finished[not_finished] = not_finished2
n = np.sum(not_finished)
return maxs+1
def coin_toss_map(toss,probability=[.5,.5],repetitions_at_a_time=10**5,flips_per_try=20):
n_chunks, remainder = divmod(toss,repetitions_at_a_time)
args = [(probability,repetitions_at_a_time,flips_per_try) for _ in range(n_chunks)]
if remainder:
args.append((probability,remainder,flips_per_try))
flips = np.concatenate(map(toss_chunk,args))
# Once all values are obtained, average and return them
Expected_Value, Variance = np.mean(flips), np.var(flips)
return Expected_Value, Variance
Performance comparison
In my computer, I got the following computation time:
In [1]: %timeit coin_toss(10**6)
# Out
# ('E[X]: 2.000287', '\nvar[X]: 1.99791891763')
# ('E[X]: 2.000459', '\nvar[X]: 2.00692478932')
# ('E[X]: 1.998118', '\nvar[X]: 1.98881045808')
# ('E[X]: 1.9987', '\nvar[X]: 1.99508631')
# 1 loop, best of 3: 46.2 s per loop
In [2]: %timeit coin_toss_2d(10**6,repetitions_at_a_time=5*10**5,flips_per_try=4)
# Out
# 1 loop, best of 3: 197 ms per loop
In [3]: %timeit coin_toss_map(10**6,repetitions_at_a_time=4*10**5,flips_per_try=4)
# Out
# 1 loop, best of 3: 192 ms per loop
And the results for the mean and variance are:
In [4]: [coin_toss_2d(10**6,repetitions_at_a_time=10**5,flips_per_try=10) for _ in range(4)]
# Out
# [(1.999848, 1.9990739768960009),
# (2.000654, 2.0046035722839997),
# (1.999835, 2.0072329727749993),
# (1.999277, 2.001566477271)]
In [4]: [coin_toss_map(10**6,repetitions_at_a_time=10**5,flips_per_try=4) for _ in range(4)]
# Out
# [(1.999552, 2.0005057992959996),
# (2.001733, 2.011159996711001),
# (2.002308, 2.012128673136001),
# (2.000738, 2.003613455356)]

Why is insertion sort running O(n^2) on a sorted list?

I'm doing algorithm analysis in Python and in short I ran into what I suspect is a problem. I need to analyze the run time of insertion sort on sorted arrays so I have the following code
def insertionSort(arr1):
t1 = time.time();
insertion_sort.sort(arr1)
t2 = time.time();
print (len(arr1));
print (str(t2-t1));
...
print ('Insertion Sort Beginning');
for x in range(0,15):
#Creates arrays of various sizes with integers from 0-100 randomly sprawled about. They are named after the power of 10 the size is
toSort1 = [randint(0,100)] * randint(100000,10000000);
#Presorts the array, insertion sort is faster for small inputs
sorted_list = insertion_sort.sort(toSort1);
##################################
insertionSort(sorted_list);
The problem is the output of this is O(n^2)! I'm new to Python so I figure this is probably a semantics error I didn't catch. insertion_sort should be assumed to be correct, but can be reviewed here. This could only be the case if it was sorted in reverse order when it was timed but it was literally passed to the same sorter twice. How could this be?
This is a graphical representation of the output

I got linear time results with your program.
I added implementation of insertion-sort and little modification of the code as below for this tests.
from random import randint
import time
def insertion_sort(arr):
# Traverse through 1 to len(arr)
for i in range(1, len(arr)):
key = arr[i]
# Move elements of arr[0..i-1], that are
# greater than key, to one position ahead
# of their current position
j = i-1
while j >=0 and key < arr[j] :
arr[j+1] = arr[j]
j -= 1
arr[j+1] = key
def insertionSort(arr1):
t1 = time.time();
insertion_sort(arr1)
t2 = time.time();
print str(len(arr1)) + ", " + str(t2-t1)
print ('Insertion Sort Beginning');
for x in range(0,15):
#Creates arrays of various sizes with integers from 0-100 randomly sprawled about. They are named after the power of 10 the size is
toSort1 = [randint(0,100)] * randint(100000,10000000);
#Presorts the array, insertion sort is faster for small inputs
sorted_list = sorted(toSort1);
##################################
insertionSort(sorted_list);
Hope it helps!

Given a string of a million numbers, return all repeating 3 digit numbers

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)
I pretty much screwed up on the first interview problem...
Question: Given a string of a million numbers (Pi for example), write
a function/program that returns all repeating 3 digit numbers and number of
repetition greater than 1
For example: if the string was: 123412345123456 then the function/program would return:
123 - 3 times
234 - 3 times
345 - 2 times
They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:
000 --> 999
Now that I'm thinking about it, I don't think it's possible to come up with a constant time algorithm. Is it?

You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :-)
There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.
Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.
It appears to me you could have impressed them in a number of ways.
First, by informing them that it's not possible to do it in O(1), unless you use the "suspect" reasoning given above.
Second, by showing your elite skills by providing Pythonic code such as:
inpStr = '123412345123456'
# O(1) array creation.
freq = [0] * 1000
# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
freq[val] += 1
# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])
This outputs:
[(123, 3), (234, 3), (345, 2)]
though you could, of course, modify the output format to anything you desire.
And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.
And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.
Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):
vv
123412 vv
123451
5123456
You can farm these out to separate workers and combine the results afterwards.
The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.
This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.
For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:
#include <stdio.h>
#include <string.h>
int main(void) {
static char inpStr[100000000+1];
static int freq[1000];
// Set up test data.
memset(inpStr, '1', sizeof(inpStr));
inpStr[sizeof(inpStr)-1] = '\0';
// Need at least three digits to do anything useful.
if (strlen(inpStr) <= 2) return 0;
// Get initial feed from first two digits, process others.
int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
char *inpPtr = &(inpStr[2]);
while (*inpPtr != '\0') {
// Remove hundreds, add next digit as units, adjust table.
val = (val % 100) * 10 + *inpPtr++ - '0';
freq[val]++;
}
// Output (relevant part of) table.
for (int i = 0; i < 1000; ++i)
if (freq[i] > 1)
printf("%3d -> %d\n", i, freq[i]);
return 0;
}

Constant time isn't possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.
For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.

A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:
from collections import Counter
def triple_counter(s):
c = Counter(s[n-3: n] for n in range(3, len(s)))
for tri, n in c.most_common():
if n > 1:
print('%s - %i times.' % (tri, n))
else:
break
if __name__ == '__main__':
import random
s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
triple_counter(s)
Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.
Parallel execution version
I wrote a blog post on this with more explanation.

The simple O(n) solution would be to count each 3-digit number:
for nr in range(1000):
cnt = text.count('%03d' % nr)
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
This would search through all 1 million digits 1000 times.
Traversing the digits only once:
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
for nr, cnt in enumerate(counts):
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
Timing shows that iterating only once over the index is twice as fast as using count.

Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.
def setup_data(n):
import random
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_np(text):
# Get the data into NumPy
import numpy as np
a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
# Rolling triplets
a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)
bins = np.zeros((10, 10, 10), dtype=int)
# Next line performs O(n) binning
np.add.at(bins, tuple(a3), 1)
# Filtering is left as an exercise
return bins.ravel()
def f_py(text):
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
return counts
import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
data = setup_data(n)
ref = f_np(**data)
print(f'n = {n}')
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.all(ref == func(**data))
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
except:
print("{:16s} apparently crashed".format(name[2:]))
Unsurprisingly, NumPy is a bit faster than #Daniel's pure Python solution on large data sets. Sample output:
# n = 10
# np 0.03481400 ms
# py 0.00669330 ms
# n = 1000
# np 0.11215360 ms
# py 0.34836530 ms
# n = 1000000
# np 82.46765980 ms
# py 360.51235450 ms

I would solve the problem as follows:
def find_numbers(str_num):
final_dict = {}
buffer = {}
for idx in range(len(str_num) - 3):
num = int(str_num[idx:idx + 3])
if num not in buffer:
buffer[num] = 0
buffer[num] += 1
if buffer[num] > 1:
final_dict[num] = buffer[num]
return final_dict
Applied to your example string, this yields:
>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}
This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.

As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3-digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn't exists already in the dictionary.
The code will look something like this:
def calc_repeating_digits(number):
hash = {}
for i in range(len(str(number))-2):
current_three_digits = number[i:i+3]
if current_three_digits in hash.keys():
hash[current_three_digits] += 1
else:
hash[current_three_digits] = 1
return hash
You can filter down to the keys which have item value greater than 1.

As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.
However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.
My guess is that either the interviewer misspoke when they gave you the solution, or you misheard "constant time" when they said "constant space."

Here's my answer:
from timeit import timeit
from collections import Counter
import types
import random
def setup_data(n):
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_counter(text):
c = Counter()
for i in range(len(text)-2):
ss = text[i:i+3]
c.update([ss])
return (i for i in c.items() if i[1] > 1)
def f_dict(text):
d = {}
for i in range(len(text)-2):
ss = text[i:i+3]
if ss not in d:
d[ss] = 0
d[ss] += 1
return ((i, d[i]) for i in d if d[i] > 1)
def f_array(text):
a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
for n in range(len(text)-2):
i, j, k = (int(ss) for ss in text[n:n+3])
a[i][j][k] += 1
for i, b in enumerate(a):
for j, c in enumerate(b):
for k, d in enumerate(c):
if d > 1: yield (f'{i}{j}{k}', d)
for n in (1E1, 1E3, 1E6):
n = int(n)
data = setup_data(n)
print(f'n = {n}')
results = {}
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
for r in results:
print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))
The array lookup method is very fast (even faster than #paul-panzer's numpy method!). Of course, it cheats since it isn't technicailly finished after it completes, because it's returning a generator. It also doesn't have to check every iteration if the value already exists, which is likely to help a lot.
n = 10
counter 0.10595780 ms
dict 0.01070654 ms
array 0.00135370 ms
f_counter : []
f_dict : []
f_array : []
n = 1000
counter 2.89462101 ms
dict 0.40434612 ms
array 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter 2849.00500992 ms
dict 438.44007806 ms
array 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]

Image as answer:
Looks like a sliding window.

Here is my solution:
from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}
With a bit of creativity in for loop(and additional lookup list with True/False/None for example) you should be able to get rid of last line, as you only want to create keys in dict that we visited once up to that point.
Hope it helps :)

-Telling from the perspective of C.
-You can have an int 3-d array results[10][10][10];
-Go from 0th location to n-4th location, where n being the size of the string array.
-On each location, check the current, next and next's next.
-Increment the cntr as resutls[current][next][next's next]++;
-Print the values of
results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]
-It is O(n) time, there is no comparisons involved.
-You can run some parallel stuff here by partitioning the array and calculating the matches around the partitions.

inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'
count = {}
for i in range(len(inputStr) - 2):
subNum = int(inputStr[i:i+3])
if subNum not in count:
count[subNum] = 1
else:
count[subNum] += 1
print count

Binary mask with shift operation without cycle

We have some large binary number N (large means millions of digits). We also have binary mask M where 1 means that we must remove digit in this position in number N and move all higher bits one position right.
Example:
N = 100011101110
M = 000010001000
Res 1000110110
Is it possible to solve this problem without cycle with some set of logical or arithmetical operations? We can assume that we have access to bignum arithmetic in Python.
Feels like it should be something like this:
Res = N - (N xor M)
But it doesn't work
UPD: My current solution with cycle is following:
def prepare_reduced_arrays(dict_of_N, mask):
'''
mask: string '0000011000'
each element of dict_of_N - big python integer
'''
capacity = len(mask)
answer = dict()
for el in dict_of_N:
answer[el] = 0
new_capacity = 0
for i in range(capacity - 1, -1, -1):
if mask[i] == '1':
continue
cap2 = (1 << new_capacity)
pos = (capacity - i - 1)
for el in dict_of_N:
current_bit = (dict_of_N[el] >> pos) & 1
if current_bit:
answer[el] |= cap2
new_capacity += 1
return answer, new_capacity

While this may not be possible without a loop in python, it can be made extremely fast with numba and just in time compilation. I went on the assumption that your inputs could be easily represented as boolean arrays, which would be very simple to construct from a binary file using struct. The method I have implemented involves iterating a few different objects, however these iterations were chosen carefully to make sure they were compiler optimized, and never doing the same work twice. The first iteration is using np.where to locate the indices of all the bits to delete. This specific function (among many others) is optimized by the numba compiler. I then use this list of bit indices to build the slice indices for slices of bits to keep. The final loop copies these slices to an empty output array.
import numpy as np
from numba import jit
from time import time
def binary_mask(num, mask):
num_nbits = num.shape[0] #how many bits are in our big num
mask_bits = np.where(mask)[0] #which bits are we deleting
mask_n_bits = mask_bits.shape[0] #how many bits are we deleting
start = np.empty(mask_n_bits + 1, dtype=int) #preallocate array for slice start indexes
start[0] = 0 #first slice starts at 0
start[1:] = mask_bits + 1 #subsequent slices start 1 after each True bit in mask
end = np.empty(mask_n_bits + 1, dtype=int) #preallocate array for slice end indexes
end[:mask_n_bits] = mask_bits #each slice ends on (but does not include) True bits in the mask
end[mask_n_bits] = num_nbits + 1 #last slice goes all the way to the end
out = np.empty(num_nbits - mask_n_bits, dtype=np.uint8) #preallocate return array
for i in range(mask_n_bits + 1): #for each slice
a = start[i] #use local variables to reduce number of lookups
b = end[i]
c = a - i
d = b - i
out[c:d] = num[a:b] #copy slices
return out
jit_binary_mask = jit("b1[:](b1[:], b1[:])")(binary_mask) #decorator without syntax sugar
###################### Benchmark ########################
bignum = np.random.randint(0,2,1000000, dtype=bool) # 1 million random bits
bigmask = np.random.randint(0,10,1000000, dtype=np.uint8)==9 #delete about 1 in 10 bits
t = time()
for _ in range(10): #10 cycles of just numpy implementation
out = binary_mask(bignum, bigmask)
print(f"non-jit: {time()-t} seconds")
t = time()
out = jit_binary_mask(bignum, bigmask) #once ahead of time to compile
compile_and_run = time() - t
t = time()
for _ in range(10): #10 cycles of compiled numpy implementation
out = jit_binary_mask(bignum, bigmask)
jit_runtime = time()-t
print(f"jit: {jit_runtime} seconds")
print(f"estimated compile_time: {compile_and_run - jit_runtime/10}")
In this example, I execute the benchmark on a boolean array of length 1,000,000 a total of 10 times for both the compiled and un-compiled version. On my laptop, the output is:
non-jit: 1.865583896636963 seconds
jit: 0.06370806694030762 seconds
estimated compile_time: 0.1652850866317749
As you can see with a simple algorithm like this, very significant performance gains can be seen from compilation. (in my case about 20-30x speedup)

As far as I know, this can be done without the use of loops if and only if M is a power of 2.
Let's take your example, and modify M so that it is a power of 2:
N = 0b100011101110 = 2286
M = 0b000000001000 = 8
Removing the fourth lowest bit from N and shifting the higher bits to the right would result in:
N = 0b10001110110 = 1142
We achieved this using the following algorithm:
Begin with N = 0b100011101110 = 2286
Iterate from the most-significant bit to the least-significant bit in M.
If the current bit in M is set to 1, then store the lower bits in some variable, x:
x = 0b1101110
Then, subtract every bit up to and including the current bit in M from N, so that we end up with the following:
N - (0b10000000 + x) = N - (0b10000000 + 0b1101110) = 0b100011101110 - 0b11101110 = 0b100000000000
This step can also be achieved by and-ing the bits with 0, which may be more efficient.
Next, we shift the result once to the right:
0b100000000000 >> 1 = 0b10000000000
Finally, we add back x to the shifted result:
0b10000000000 + x = 0b10000000000 + 0b1101110 = 0b10001101110 = 1142
There may be a possibility that this can somehow be done without loops, but it would actually be efficient if you were to simply iterate over M (from the most-significant bit to the least-significant bit) and performed this process on every set bit, as the time complexity would be O(M.bit_length()).
I wrote up the code for this algorithm as well, and I believe it's relatively efficient, but I don't have any big binary numbers to test it with:
def remove_bits(N, M):
bit = 2 ** (M.bit_length() - 1)
while bit != 0:
if M & bit:
ones = bit - 1
# Store lower `bit` bits.
temp = N & ones
# Clear lower `bit` bits.
N &= ~ones
# Shift once to the right.
N >>= 1
# Set stored lower `bit` bits.
N |= temp
bit >>= 1
return N
if __name__ == '__main__':
N = 0b100011101110
M = 0b000010001000
print(bin(remove_bits(N, M)))
Using your example, this returns your result: 0b1000110110

I don't think there's any way to do this in a constant number of calls to the built-in bitwise operators. Python would have to provide something like PEXT for that to be possible.
For literally millions of digits, you may actually get best performance by working in terms of sequences of bits, sacrificing the space advantages of Python ints and the time advantages of bitwise operations in favor of more flexibility in the operations you can perform. I don't know where the break-even point would be:
import itertools
bits = bin(N)[2:]
maskbits = bin(M)[2:].zfill(len(bits))
bits = bits.zfill(len(maskbits))
chosenbits = itertools.compress(bits, map('0'.__eq__, maskbits))
result = int(''.join(chosenbits), 2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Amortized O(1) rolling minimum implemented in Python Numba/NumPy - python

Related

Unique ordered ratio of integers

Iteration performance

Why is insertion sort running O(n^2) on a sorted list?

Given a string of a million numbers, return all repeating 3 digit numbers

Binary mask with shift operation without cycle

Categories

Resources