How to find the median between two sorted arrays?

How to find the median between two sorted arrays? - python

I'm working on a competitive programming problem where we're trying to find the median of two sorted arrays. The optimal algorithm is to perform a binary search and identify splitting points, i and j, between the two arrays.
I'm having trouble deriving the solution myself. I don't understand the initial logic. I will follow how I think of the problem so far.
The concept of the median is to partition the given array into two sets. Consider a hypothetical left array and a hypothetical right array after merging the two given arrays. Both these arrays are of the same length.
We know that the median given both those hypothetical arrays works out to be [max(left) + min(right)]/2. This makes sense so far. But the issue here is now knowing how to construct the left and right arrays.
We can choose a splitting point on ArrayA as i and a splitting point on ArrayB as j. Note that len(ArrayB[:j] + ArrayB[:i]) == len(ArrayB[j:] +ArrayB[i:]).
Now we just need to find the cutting points. We could try all splitting points i, j such that they satisfy the median condition. However this works out to be O(m*n) where M is size of ArrayB and where N is size of ArrayA.
I'm not sure how to get where I am to the binary search solution using my train of thought. If someone could give me pointers - that would be awesome.

Here is my approach that I managed to come up with.
First of all we know that the resulting array will contain N+M elements, meaning that the left part will contain (N+M)/2 elements, and the right part will contain (N+M)/2 elements as well. Let's denote the resulting array as Ans, and denote the size of one of its parts as PartSize.
Perform a binary search operation on array A. The range of such binary search will be [0, N]. This binary search operation will help you determine the number of elements from array A that will form the left part of the resulting array.
Now, suppose we are testing the value i. If i elements from array A are supposed to be included in the left part of the resulting array, this means that j = PartSize - i elements must be included from array B in the first part as well. We have the following possibilities:
j > M this is an invalid state. In this case it means we still need to choose more elements from array A, so our new binary search range becomes [i + 1, N].
j <= M & A[i+1] < B[j] This is a tricky case. Think about it. If the next element in array A is smaller than the element j in array B, this means that element A[i+1] is supposed to be in the left part rather than element B[j]. In this case our new binary search range becomes [i+1, N].
j <= M & A[i] > B[j+1] This is close to the previous case. If the next element in array B is smaller than the element i in array A, the means that element B[j+1] is supposed to be in the left part rather than element A[i]. In this case our new binary search range becomes [0, i-1].
j <= M & A[i+1] >= B[j] & A[i] <= B[j+1] this is the optimal case, and you have finally found your answer.
After the binary search operation is finished, and you managed to calculate both i and j, you can now easily find the value of the median. You need to handle a few cases here depending on whether N+M is odd or even.
Hope it helps!

Related

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.

my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?

If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

Function that returns the closest number to B that's in an UNSORTED multidimensional array, A?

As the title states, I want to create a function that'll take a multidimensional array A, and a number B, that ultimately returns the number in A that is the closest to B. If the number B is in A, then return it. If there's 2 numbers in A that are equally distant from B, choose the first one by counting from row to row.
This is the code I have so far:
import numpy as np
def g_C(A,B):
A = np.asanyarray(A)
assert A.ndim == 2 # to assert that A is a multidimensional array.
get = (np.abs(A-B)).argmin()
return (A[get])
However from my understanding, I think (np.abs(M-N)).argmin() really only effectively works for sorted arrays? I'm not allowed to sort the array in this problem; I have to work on it for face value, examining row by row, and grabbing the first instance of the closest number to B.
So for example, g_C([[1,3,6,-8],[2,7,1,0],[4,5,2,8],[2,3,7,10]],9) should return 8
Also, I was given the hint that numpy.argmin would help, and I see that it's purpose is to extract the first occurrence something occurs, which makes sense in this problem, but I'm not sure how exactly to fit that into the code I have at the moment.
EDIT
The flat suggestion works perfectly fine. Thank you everyone.
I'm trying RagingRoosevelt's second suggestion, and I'm stuck.
def g_C(A,B):
A = np.asanyarray(A)
D = np.full_like(A, B) # created an array D with same qualities as array A, but just filled with values of B
diffs = abs(D-A) # finding absolute value differences between D and A
close = diffs.argmin(axis=1) # find argmin of 'diffs', row by row
close = np.asanyarray(close) # converted the argmins of 'diff' into an array
closer = close.argmin() # the final argmin ??
return closer
I'm trying out this suggestion because I have another problem related to this where I have to extract the row who's sum is the closest number to B. And I figure this is good practice anyway.

Your existing code is fine except, by default, argmin returns an index to the flattened array. So you could do
return A.flat[abs(A - B).argmin()]
to get the right value from A.
EDIT: For your other problem - finding the row in a 2-dimensional array A whose sum is closest to B - you can do:
return A[abs(A.sum(axis=1) - B).argmin()]
In either case I don't see any need to create an array of B.

Your problem is the same as a find-min problem. The only difference is that you're looking for min(abs(A[i]-B)) instead. So, iterate over your array. As you do so, record the smallest absolute delta and the index at which it occurred. When you find a smaller delta, update the record and then keep searching. When you've made it all the way through, return whatever value was at the recorded index.
Since you're working with numpy arrays, another approach is that you could create an array of identical size as A but filled only with value B. Compute the difference between the arrays and then use argmin on each row. Assemble an array of all minimum values for each row and then do argmin again to pull out the smallest of the values.

This will work for any 2-dimensional array with a nested for-loop, but I am not sure that this is what you want (as in it doesn't use numpy).
def g_C(A, B):
i = A[0][0]
m = abs(B - A[0][0])
for r in A:
for i in r:
if abs(B - i) < m:
m = abs(B - i)
n = i
return n
Nevertheless, it does work:
>>> g_C([[1,3,6,-8],[2,7,1,0],[4,5,2,8],[2,3,7,10]],9)
8

How to keep track of original row indices in Numpy array when comparing to only a slice?

I'm working with a 2D numpy array A, performing a comparison of a one dimensional array, X, against each row in A. As approximate matches are found, I'm keeping track of their indices in A in a dtype=bool array S. I'd like to use S to shrink the field of match candidates in A to improve efficiency. Here's the basic idea in code:
def compare(nxt):
S[nxt] = 0 #sets boolean
T = A[nxt, i:] == A[S, :-i] #T has different dimesions than A
compare() is iterated over and S is progressively populated with False values.
The problem is that the boolean array T is of the same dimensions as the pared down version of A not the original version. I'm hoping to use T to get the indices (in the unsliced A) of the approximate matches for later use.
np.argwhere(T)
This returns a list of indices of the matches, but again in the slice of A.
It seems like there has to be a better way to, at the same time, crop A for more efficient searching and still be able to get the correct index of the matching row.
Any thoughts?

python performance tips -- loop involving comparisons

I am new to python and am trying to learn more about how it works by successively optimizing chunks of naive code I have already written.
The following function involves a loop that performs operations on the elements of a list of list of floats only when the values of the data structure satisfy some condition. I was wondering if anyone could comment on (1) ways to improve the performance of this loop and (2) general features of the type of loop I'm describing that make it more or less suitable for different approaches to improving it. Below I've included a minimal version of the loop I'm working with.
Some notes on the variables used below:
#p is a small integer (say, p=10)
#index1 is an integer between 0 and p
#k is an integer between 0 and, say, kmax=100
#mat1 is a list of list of floats whose size is [kmax,p],
# with all values initialized to 0.0.
# mat1 is changed by the loop below
#mat2 is a list of list of floats whose size is [kmax,p]
# with all values initialized to -2e10.
# mat2 is changed by other parts of the program
Also, if it matters, in my code this is all part of a class, so there are "self." statements for the variables. I have read that local variables are handled better by python functions; how does this translate to class constructs?
def greatFunction(index1,k):
index2 = index1
for j in range(p):
if (mat2[k][index2] > -1e10):
mat1[k][j] = mat1[k][j] + mat2[k][index1]*mat2[k][index2]
index2 = index2 - 1
if(index2 < 0):
index2 = index2 + p
From what I have read I thought this would be a prime candidate for replacing the lists of lists with nparrays (in the class itself, not converting things in the function) and using masks to take care of the boolean conditions. However, the numpy version I wrote turned out to be slower than the vanilla python implementation above. Any help both speeding up the code but more importantly helping me understand why and how such loops can be replaced with a better construction would be much appreciated. Thank you!

It looks like index2 goes from index1 to 0 decreasing at steps of 1 and then circulates back to p-1 and then starts decreasing again. This is basically modulo operation and thus could be simulated with np.mod to give us index2 as column indices at each iteration. Then, we index into the k-th row of mat2 and use column indices from the previous step to get all elements needed for our purpose. These elements are compared with the threshold (-1e10 in our case), giving us a mask, which is used to select elements from that row and set the corresponding ones into the output array mat1 after scaling with mat2[k][index1].
Since you are working with NumPy arrays, I am assuming you already have mat1 and mat2 converted to NumPy arrays with np.array().
Also, as mentioned in the comments, if mat2 has all its values initialized to -2e10, then mat2[k][index2] > -1e10 would never be true, so in that particular case, mat1 would keep its zeros. Thus, to generically explain how to vectorize such a case, I am assuming mat2 to have random numbers instead. The implementation would look something like this -
# Get column indices corresponding to index2
col_idx = np.mod(np.arange(index1,index1-p,-1),p)
# Get all mat2 values with those column indices at kth row
mat2_val = mat2[k,col_idx]
# Mask of valid elements
mask = mat2_val > -1e10
# Set valid ones in mat1 with valid ones from mat2
# after scaling with mat2[k][index1]
mat1[k][mask] = mat2[k][index1]*mat2_val[mask]

4-sum algorithm in Python [duplicate]

This question already has answers here:
Quadratic algorithm for 4-SUM
(3 answers)
Closed 9 years ago.
I am trying to find whether a list has 4 elements that sum to 0 (and later find what those elements are). I'm trying to make a solution based off the even k algorithm described at https://cs.stackexchange.com/questions/2973/generalised-3sum-k-sum-problem.
I get this code in Python using combinations from the standard library
def foursum(arr):
seen = {sum(subset) for subset in combinations(arr,2)}
return any(-x in seen for x in seen)
But this fails for input like [-1, 1, 2, 3]. It fails because it matches the sum (-1+1) with itself. I think this problem will get even worse when I want to find the elements because you can separate a set of 4 distinct items into 2 sets of 2 items in 6 ways: {1,4}+{-2,-3}, {1,-2}+{4,-3} etc etc.
How can I make an algorithm that correctly returns all solutions avoiding this problem?
EDIT: I should have added that I want to use as efficient algorithm as possible. O(len(arr)^4) is too slow for my task...

This works.
import itertools
def foursum(arr):
seen = {}
for i in xrange(len(arr)):
for j in xrange(i+1,len(arr)):
if arr[i]+arr[j] in seen: seen[arr[i]+arr[j]].add((i,j))
else: seen[arr[i]+arr[j]] = {(i,j)}
for key in seen:
if -key in seen:
for (i,j) in seen[key]:
for (p,q) in seen[-key]:
if i != p and i != q and j != p and j != q:
return True
return False
EDIT
This can be made more pythonic i think, I don't know enough python.

It is normal for the 4SUM problem to permit input elements to be used multiple times. For instance, given the input (2 3 1 0 -4 -1), valid solutions are (3 1 0 -4) and (0 0 0 0).
The basic algorithm is O(n^2): Use two nested loops, each running over all the items in the input, to form all sums of pairs, storing the sums and their components in some kind of dictionary (hash table, AVL tree). Then scan the pair-sums, reporting any quadruple for which the negative of the pair-sum is also present in the dictionary.
If you insist on not duplicating input elements, you can modify the algorithm slightly. When computing the two nested loops, start the second loop beyond the current index of the first loop, so no input elements are taken twice. Then, when scanning the dictionary, reject any quadruples that include duplicates.
I discuss this problem at my blog, where you will see solutions in multiple languages, including Python.

First note that the problem is O(n^4) in worst case, since the output size might be of O(n^4) (you are looking for finding all solutions, not only the binary problem).
Proof:
Take an example of [-1]*(n/2).extend([1]*(n/2)). you need to "choose" two instances of -1 w/o repeats - (n/2)*(n/2-1)/2 possibilities, and 2 instances of 1 w/o repeats - (n/2)*(n/2-1)/2 possibilities. This totals in (n/2)*(n/2-1)*(n/2)*(n/2-1)/4 which is in Theta(n^4)
Now, that we understood we cannot achieve O(n^2logn) worst case, we can get to the following algorithm (pseudo-code), that should scale closer to O(n^2logn) for "good" cases (few identical sums), and get O(n^4) worst case (as expected).
Pseudo-code:
subsets <- all subsets of size of indices (not values!)
l <- empty list
for each s in subsets:
#appending a triplet of (sum,idx1,idx2):
l.append(( arr[s[0]] + arr[s[1]], s[0],s[1]))
sort l by first element (sum) in each tupple
for each x in l:
binary search l for -x[0] #for the sum
for each element y that satisfies the above:
if x[1] != y[1] and x[2] != y[1] and x[1] != y[2] and x[2] != y[2]:
yield arr[x[1]], arr[x[2]], arr[y[1]], arr[y[2]]
Probably a pythonic way to do the above will be more elegant and readable, but I am not a python expert I am afraid.

EDIT: Ofcourse the algorithm shall be atleast as time complex as per the solution size!
If the number of possible solutions is not 'large' as compared to n, then
A suggested solution in O(N^3):
Find pair-wise sums of all elements and build a NxN matrix of the sums.
For each element in this matrix, build a struct that would have sumValue, row and column as it fields.
Sort all these N^2 struct elements in a 1D array. (in O(N^2 logN) time).
For each element x in this array, conduct a binary search for its partner y such that x + y = 0 (O(logn) per search).
Now if you find a partner y, check if its row or column field matches with the element x. If so, iterate sequentially in both directions until either there is no more such y.
If you find some y's that do not have a common row or column with x, then increment the count (or print the solution).
This iteration can at most take 2N steps because the length of rows and columns is N.
Hence the total order of complexity for this algorithm shall be O(N^2 * N) = O(N^3)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find the median between two sorted arrays? - python

Related

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

Function that returns the closest number to B that's in an UNSORTED multidimensional array, A?

How to keep track of original row indices in Numpy array when comparing to only a slice?

python performance tips -- loop involving comparisons

4-sum algorithm in Python [duplicate]

Categories

Resources