Get Indices To Split NumPy Array - python

Let's say I have a NumPy array:
x = np.array([3, 9, 2, 1, 5, 4, 7, 7, 8, 6])
If I sum up this array, I get 52. What I need is a way to split it up starting from left to right into roughly n chunks where n is chosen by the user. Essentially, the splits occur in a greedy fashion. So, for some number of chunks n, the first n - 1 chunks must each sum up to at least 52/n and they must be consecutive indices taken from left to right.
So, if n = 2 then the first chunk would consist of the first 7 elements:
chunk[0] = x[:7] # [3, 9, 2, 1, 5, 4, 7], sum = 31
chunk[1] = x[7:] # [7, 8, 6], sum = 21
Notice that the first chunk wouldn't consist of the first 6 elements only since the sum would be 24 which is less than 52/2 = 26. Also, notice that the number of elements in each chunk is allowed to vary as long as the sum criteria is met. Finally, it is perfectly fine for the last chunk to not be close to 52/2 = 26 since the other chunk(s) may take more.
However, the output that I need is a two column array that contains the start index in the first column and the (exclusive) stop index in the second column:
[[0, 7],
[7, 10]]
If n = 4, then the first 3 chunks need to each sum up to at least 52/4 = 13 and would look like this:
chunk[0] = x[:3] # [3, 9, 2], sum = 14
chunk[1] = x[3:7] # [1, 5, 4], sum = 17
chunk[2] = x[7:9] # [7, 8], sum = 15
chunk[3] = x[9:] # [6], sum = 6
And the output that I need would be:
[[0, 3],
[3, 7],
[7, 9],
[9, 10]
So, one naive approach using for loops might be:
ranges = np.zeros((n_chunks, 2), np.int64)
ranges_idx = 0
range_start_idx = start
sum = 0
for i in range(x.shape[0]):
sum += x[i]
if sum > x.sum() / n_chunks:
ranges[ranges_idx, 0] = range_start_idx
ranges[ranges_idx, 1] = min(
i + 1, x.shape[0]
) # Exclusive stop index
# Reset and Update
range_start_idx = i + 1
ranges_idx += 1
sum = 0
# Handle final range outside of for loop
ranges[ranges_idx, 0] = range_start_idx
ranges[ranges_idx, 1] = x.shape[0]
if ranges_idx < n_chunks - 1:
left[ranges_idx:] = x.shape[0]
return ranges
I am looking for a nicer vectorized solution.

I found inspiration in a similar question that was answered:
def func(x, n):
out = np.zeros((n, 2), np.int64)
cum_arr = x.cumsum() / x.sum()
idx = 1 + np.searchsorted(cum_arr, np.linspace(0, 1, n, endpoint=False)[1:])
out[1:, 0] = idx # Fill the first column with start indices
out[:-1, 1] = idx # Fill the second column with exclusive stop indices
out[-1, 1] = x.shape[0] # Handle the stop index for the final chunk
return out
Update
To cover the pathological case, we need to be a little more precise and do something like:
def func(x, n, truncate=False):
out = np.zeros((n_chunks, 2), np.int64)
cum_arr = x.cumsum() / x.sum()
idx = 1 + np.searchsorted(cum_arr, np.linspace(0, 1, n, endpoint=False)[1:])
out[1:, 0] = idx # Fill the first column with start indices
out[:-1, 1] = idx # Fill the second column with exclusive stop indices
out[-1, 1] = x.shape[0] # Handle the stop index for the final chunk
# Handle pathological case
diff_idx = np.diff(idx)
if np.any(diff_idx == 0):
row_truncation_idx = np.argmin(diff_idx) + 2
out[row_truncation_idx:, 0] = x.shape[0]
out[row_truncation_idx-1:, 1] = x.shape[0]
if truncate:
out = out[:row_truncation_idx]
return out

Here is a solution that doesn't iterate over all elements:
def fun2(array, n):
min_sum = np.sum(array) / n
cumsum = np.cumsum(array)
i = -1
count = min_sum
out = []
while i < len(array)-1:
j = np.searchsorted(cumsum, count)
out.append([i+1, j+1])
i = j
if i < len(array):
count = cumsum[i] + min_sum
out[-1][1] -= 1
return np.array(out)
For the two test cases it produces the results you expected. HTH

Related

Cut a sequence of length N into subsequences such that the sum of each subarray is less than M and the cut minimizes the sum of max of each part

Given an integer array sequence a_n of length N, cut the sequence into several parts such that every one of which is a consequtive subsequence of the original sequence.
Every part must satisfy the following:
The sum of each part is not greater than a given integer M
Find a cut that minimizes the sum of the maximum integer of each part
For example:
input : n = 8, m = 17 arr = [2, 2, 2, 8, 1, 8, 2, 1]
output = 12
explanation: subarrays = [2, 2, 2], [8, 1, 8], [2, 1]
sum = 2 + 8 + 2 = 12
0 <= N <= 100000
each integer is between 0 and 1000000
If no such cut exists, return -1
I believe this is a dynamic programming question, but I am not sure how to approach this.
I am relatively new to coding, and came across this question in an interview which I could not do. I would like to know how to solve it for future reference.
Heres what I tried:
n = 8
m = 17
arr = [2, 2, 2, 8, 1, 8, 2, 1]
biggest_sum, i = 0, 0
while (i < len(arr)):
seq_sum = 0
biggest_in_seq = -1
while (seq_sum <= m and i < len(arr)):
if (seq_sum + arr[i] <= m ):
seq_sum += arr[i]
if (arr[i] > biggest_in_seq):
biggest_in_seq = arr[i]
i += 1
else:
break
biggest_sum += biggest_in_seq
if (biggest_sum == 0):
print(-1)
else:
print(biggest_sum)
This givens the result 16, and the subsequences are: [[2, 2, 2, 8, 1], [8, 2, 1]]
Problem is that you are filling every sequence from left to right up to the maximum allowed value m. You should evaluate different options of sequence lengths and minimize the result, which in the example means that the 2 8 values must be in the same sequence.
a possible solution could be:
n = 8
m = 17
arr = [2, 2, 2, 8, 1, 8, 2, 1]
def find_solution(arr, m, n):
if max(arr)>m:
return -1
optimal_seq_length = [0] * n
optimal_max_sum = [0] * n
for seq_start in reversed(range(n)):
seq_len = 0
seq_sum = 0
seq_max = 0
while True:
seq_len += 1
seq_end = seq_start + seq_len
if seq_end > n:
break
last_value_in_seq = arr[seq_end - 1]
seq_sum += last_value_in_seq
if seq_sum > m:
break
seq_max = max(seq_max, last_value_in_seq)
max_sum_from_next_seq_on = 0 if seq_end >= n else optimal_max_sum[seq_end]
max_sum = max_sum_from_next_seq_on + seq_max
if seq_len == 1 or max_sum < optimal_max_sum[seq_start]:
optimal_max_sum[seq_start] = max_sum
optimal_seq_length[seq_start] = seq_len
# create solution list of lists
solution = []
seg_start = 0
while seg_start < n:
seg_length = optimal_seq_length[seg_start]
solution.append(arr[seg_start:seg_start+seg_length])
seg_start += seg_length
return solution
print(find_solution(arr, m, n))
# [[2, 2, 2], [8, 1, 8], [2, 1]]
Key aspects of my proposal:
start from a small array (only last element), and make the problem array grow to the front:
[1]
[2, 1]
[8, 2, 1]
etc.
for each of above problem arrays, store:
the optimal sum of the maximum of each sequence (optimal_max_sum), which is the value to be minimized
the sequence length of the first sequence (optimal_seq_length) to achieve this optimal value
do this by: for each allowed sequence length starting at the beginning of the problem array:
calculate the new max_sum value and add it to previously calculated optimal_max_sum for the part after this sequence
keep the smallest max_sum, store it in optimal_max_sum and the associated seq_length in optimal_seq_length

Length of the intersections between a list an list of list

Note : almost duplicate of Numpy vectorization: Find intersection between list and list of lists
Differences :
I am focused on efficiently when the lists are large
I'm searching for the largest intersections.
x = [500 numbers between 1 and N]
y = [[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12], etc. up to N]
Here are some assumptions:
y is a list of ~500,000 sublist of ~500 elements
each sublist in y is a range, so y is characterized by the last elements of each sublists. In the example : 3, 7, 9, 12 ...
x is not sorted
y contains once and only once each numbers between 1 and ~500000*500
y is sorted in the sense that, as in the example, the sub-lists are sorted and the first element of one sublist is the next of the last element of the previous list.
y is known long before even compile-time
My purpose is to know, among the sublists of y, which have at least 10 intersections with x.
I can obviously make a loop :
def find_best(x, y):
result = []
for index, sublist in enumerate(y):
intersection = set(x).intersection(set(sublist))
if len(intersection) > 2: # in real live: > 10
result.append(index)
return(result)
x = [1, 2, 3, 4, 5, 6]
y = [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]
res = find_best(x, y)
print(res) # [0, 2]
Here the result is [0,2] because the first and third sublist of y have 2 elements in intersection with x.
An other method should to parse only once y and count the intesections :
def find_intersec2(x, y):
n_sublists = len(y)
res = {num: 0 for num in range(0, n_sublists + 1)}
for list_no, sublist in enumerate(y):
for num in sublist:
if num in x:
x.remove(num)
res[list_no] += 1
return [n for n in range(n_sublists + 1) if res[n] >= 2]
This second method uses more the hypothesis.
Questions :
what optimizations are possibles ?
Is there a completely different approach ? Indexing, kdtree ? In my use case, the large list y is known days before the actual run. So i'm not afraid to buildind an index or whatever from y. The small list x is only known at runtime.
Since y contains disjoint ranges and the union of them is also a range, a very fast solution is to first perform a binary search on y and then count the resulting indices and only return the ones that appear at least 10 times. The complexity of this algorithm is O(Nx log Ny) with Nx and Ny the number of items in respectively x and y. This algorithm is nearly optimal (since x needs to be read entirely).
Actual implementation
First of all, you need to transform your current y to a Numpy array containing the beginning value of all ranges (in an increasing order) with N as the last value (assuming N is excluded for the ranges of y, or N+1 otherwise). This part can be assumed as free since y can be computed at compile time in your case. Here is an example:
import numpy as np
y = np.array([1, 4, 8, 10, 13, ..., N])
Then, you need to perform the binary search and check that the values fits in the range of y:
indices = np.searchsorted(y, x, 'right')
# The `0 < indices < len(y)` check should not be needed regarding the input.
# If so, you can use only `indices -= 1`.
indices = indices[(0 < indices) & (indices < len(y))] - 1
Then you need to count the indices and filter the ones with at least :
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 10]
Here is an example based on your:
x = np.array([1, 2, 3, 4, 5, 6])
# [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]
y = np.array([1, 4, 5, 7, 8, 12])
# Actual simplified version of the above algorithm
indices = np.searchsorted(y, x, 'right') - 1
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 2]
# [0, 2]
print(result.tolist())
It runs in less than 0.1 ms on my machine on a random input based on your input constraints.
Turn y into 2 dicts.
index = { # index to count map
0 : 0,
1 : 0,
2 : 0,
3 : 0,
4 : 0
}
y = { # elem to index map
1: 0,
2: 0,
3: 0,
4: 1,
5: 2,
6: 2,
7: 3,
8 : 4,
9 : 4,
10 : 4,
11 : 4
}
Since you know y in advance, I don't count the above operations into the time complexity. Then, to count the intersection:
x = [1, 2, 3, 4, 5, 6]
for e in x: index[y[e]] += 1
Since you mentioned x is small, I try to make the time complexity depends only on the size of x (in this case O(n)).
Finally, the answer is the list of keys in index dict where the value is >= 2 (or 10 in real case).
answer = [i for i in index if index[i] >= 2]
This uses y to create a linear array mapping every int to the (1 plus), the index of the range or subgroup the int is in; called x2range_counter.
x2range_counter uses a 32 bit array.array type to save memory and can be cached and used for calculations of all x on the same y.
calculating the hits in each range for a particular x is then just indirected array incrementing of a count'er in function count_ranges`.
y = [[1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11, 12]]
x = [5, 3, 1, 11, 8, 10]
range_counter_max = len(y)
extent = y[-1][-1] + 1 # min in y must be 1 not 0 remember.
x2range_counter = array.array('L', [0] * extent) # efficient 32 bit array storage
# Map any int in any x to appropriate ranges counter.
for range_counter_index, rng in enumerate(y, start=1):
for n in rng:
x2range_counter[n] = range_counter_index
print(x2range_counter) # array('L', [0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4])
# x2range_counter can be saved for this y and any x on this y.
def count_ranges(x: List[int]) -> List[int]:
"Number of x-hits on each y subgroup in order"
# Note: count[0] initially catches errors. count[1..] counts x's in y ranges [0..]
count = array.array('L', [0] * (range_counter_max + 1))
for xx in x:
count[x2range_counter[xx]] += 1
assert count[0] == 0, "x values must all exist in a y range and y must have all int in its range."
return count[1:]
print(count_ranges(x)) # array('L', [1, 2, 1, 2])
I created a class for this, with extra functionality such as returning the ranges rather than the indices; all ranges hit >=M times; (range, hit-count) tuples sorted most hit first.
Range calculations for different x are proportional to x and are simple array lookups rather than any hashing of dicts.
What do you think?

sliding window algorithm - condition for start < n

Below is a sliding window solution for finding the minimum length subarray with a sum greater than x from geeksforgeeks (https://www.geeksforgeeks.org/minimum-length-subarray-sum-greater-given-value/)
# O(n) solution for finding smallest
# subarray with sum greater than x
# Returns length of smallest subarray
# with sum greater than x. If there
# is no subarray with given sum, then
# returns n + 1
def smallestSubWithSum(arr, n, x):
# Initialize current sum and minimum length
curr_sum = 0
min_len = n + 1
# Initialize starting and ending indexes
start = 0
end = 0
while (end < n):
# Keep adding array elements while current
# sum is smaller than x
while (curr_sum <= x and end < n):
curr_sum += arr[end]
end+= 1
# If current sum becomes greater than x.
while (curr_sum > x and start < n):
# Update minimum length if needed
if (end - start < min_len):
min_len = end - start
# remove starting elements
curr_sum -= arr[start]
start+= 1
return min_len
I have tested that this solution can work, but I'm confused by why in the last while loop, start is checked for being less than n - wouldn't you want it to be less than end, otherwise start can go beyond end, which doesn't really make sense to me?
Since curr_sum was built by adding elements up to end, it will get to zero (or smaller than x) before start can reach end. This will exit the while loop. This also implies that the algorithm will probably not work with negative numbers in the array.
Personally I would have written it a little differently. Here's an example with the negative number condition taken into account:
def minSub(arr,x):
subTotal = 0
size,minSize = 0,len(arr)+1
start = iter(arr)
for value in arr:
subTotal += value
size += 1
while subTotal not in range(0,x+1):
if subTotal>x :
minSize = min(minSize,size)
subTotal -= next(start,0)
size -= 1
return minSize
output:
arr = [1, 4, 45, 6, 0, 19]
x = 51
print(minSub(arr,x)) #3
arr = [-8, 1, 4, -1, 3, -6]
x = 6
print(minSub(arr,x)) # 4
arr = [1, 11, 100, 1, 0, 200, 3, 2, 1, 250]
x = 280
print(minSub(arr,x)) # 4
arr = [1, 10, 5, 2, 7]
x = 9
print(minSub(arr,x)) # 1
arr = [1, 2, 4]
x = 8
print(minSub(arr,x)) # 4

constructing arithmetic progressions from loop

I am trying to work out a program that would calculate the diagonal coefficients of pascal's triangle.
For those who are not familiar with it, the general terms of sequences are written below.
1st row = 1 1 1 1 1....
2nd row = N0(natural number) // 1 = 1 2 3 4 5 ....
3rd row = N0(N0+1) // 2 = 1 3 6 10 15 ...
4th row = N0(N0+1)(N0+2) // 6 = 1 4 10 20 35 ...
the subsequent sequences for each row follows a specific pattern and it is my goal to output those sequences in a for loop with number of units as input.
def figurate_numbers(units):
row_1 = str(1) * units
row_1_list = list(row_1)
for i in range(1, units):
sequences are
row_2 = n // i
row_3 = (n(n+1)) // (i(i+1))
row_4 = (n(n+1)(n+2)) // (i(i+1)(i+2))
>>> def figurate_numbers(4): # coefficients for 4 rows and 4 columns
[1, 1, 1, 1]
[1, 2, 3, 4]
[1, 3, 6, 10]
[1, 4, 10, 20] # desired output
How can I iterate for both n and i in one loop such that each sequence of corresponding row would output coefficients?
You can use map or a list comprehension to hide a loop.
def f(x, i):
return lambda x: ...
row = [ [1] * k ]
for i in range(k):
row[i + 1] = map( f(i), row[i])
where f is function that descpribe the dependency on previous element of row.
Other possibility adapt a recursive Fibbonachi to rows. Numpy library allows for array arifmetics so even do not need map. Also python has predefined libraries for number of combinations etc, perhaps can be used.
To compute efficiently, without nested loops, use Rational Number based solution from
https://medium.com/#duhroach/fast-fun-with-pascals-triangle-6030e15dced0 .
from fractions import Fraction
def pascalIndexInRowFast(row,index):
lastVal=1
halfRow = (row>>1)
#early out, is index < half? if so, compute to that instead
if index > halfRow:
index = halfRow - (halfRow - index)
for i in range(0, index):
lastVal = lastVal * (row - i) / (i + 1)
return lastVal
def pascDiagFast(row,length):
#compute the fractions of this diag
fracs=[1]*(length)
for i in range(length-1):
num = i+1
denom = row+1+i
fracs[i] = Fraction(num,denom)
#now let's compute the values
vals=[0]*length
#first figure out the leftmost tail of this diag
lowRow = row + (length-1)
lowRowCol = row
tail = pascalIndexInRowFast(lowRow,lowRowCol)
vals[-1] = tail
#walk backwards!
for i in reversed(range(length-1)):
vals[i] = int(fracs[i]*vals[i+1])
return vals
Don't reinvent the triangle:
>>> from scipy.linalg import pascal
>>> pascal(4)
array([[ 1, 1, 1, 1],
[ 1, 2, 3, 4],
[ 1, 3, 6, 10],
[ 1, 4, 10, 20]], dtype=uint64)
>>> pascal(4).tolist()
[[1, 1, 1, 1], [1, 2, 3, 4], [1, 3, 6, 10], [1, 4, 10, 20]]

Sum of specific elements in a list, if they are consecutive (python)

What question is asking for is, from a list of lists like the following, to return a tuple that contains tuples of all occurrences of the number 2 for a given index on the list. If there are X consecutive 2s, then it should appear only one element in the inside tuple containing X, just like this:
[[1, 2, 2, 1],
[2, 1, 1, 2],
[1, 1, 2, 2]]
Gives
((1,), (1,), (1, 1),(2,))
While
[[2, 2, 2, 2],
[2, 1, 2, 2],
[2, 2, 1, 2]]
Gives
((3,),(1, 1),(2,)(3,))
What about the same thing but not for the columns, this time, for the rows? is there a "one-line" method to do it? I mean:
[[1, 2, 2, 1],
[2, 1, 1, 2],
[1, 1, 2, 2]]
Gives
((2,), (1, 1), (2,))
While
[[2, 2, 2, 2],
[2, 1, 2, 2],
[2, 2, 1, 2]]
Gives
((4,),(1, 2),(2, 1))
I have tried some things, this is one of the things, I can't finish it, don't know what to do anymore, after it:
l = [[2,2,2],[2,2,2],[2,2,2]]
t = (((1,1),(2,),(2,)),((2,),(2,),(1,1)))
if [x.count(0) for x in l] == [0 for x in l]:
espf = []*len(l)
espf2 = []
espf_atual = 0
contador = 0
for x in l:
for c in x:
celula = x[c]
if celula == 2:
espf_atual += 1
else:
if celula == 1:
espf[contador] = [espf_atual]
contador += 1
espf_atual = 0
espf2 += [espf_atual]
espf_atual = 0
print(tuple(espf2))
output
(3, 3, 3)
this output is the correct one but if I change the list(l) it doesn't work
So, you have som emistakes in the code.
Indexing:
for c in x:
celula = x[c]
It should be celula = c as c already points to each element of x.
Intermediate results
For each column you store intermediate results as:
espf_atual = 0
...
espf_atual += 1
...
espf2 += [espf_atual]
but this will only allow to store the last occurrences of 2 for each column. This is, if a row is [2,1,2,2], then espf_actual = 2 and you will store only the last occurrence. You will override the first occurrence (before the 1).
To avoid this, you need to store intermediate results for each row. You got it halfway with espf = []*len(l), but you never used it properly later.
Find bellow a working example (not too different from your initial solution):
espf = []
for x in l:
# Restart counters for every row
espf_current = [] # Will store any sequences of 2
contador = 0 # Will count consecutive 2's
for c in x:
celula = c
if celula == 2:
contador += 1 # Count number of 2
elif celula == 1:
if contador > 0: # Store any 2 before 1
espf_current += [contador]
contador = 0
if contador > 0: # Check if the row ends in 2
espf_current += [contador]
# Store results of this row in the final results
espf += [tuple(espf_current)]
print tuple(espf)
The key to switch rows and columns, is to change the indexing method. Currently you are iterating along the elements of the list, and thus, this doesn't allow you to switch between rows and columns.
Another way to see the iteration is to iterate indexes of the matrix (i, j for rows and columns) as follows:
numRows = len(l)
numCols = len(l[0])
for i in range(numRows):
for j in range(numCols):
celula = l[i][j]
The above is the indexing equivalent to the previous code. It assumes all the rows have the same length (which is true in your examples). Changing it from rows to columns is straightforward (tip: switch the loops), I leave it to you :P

Categories