Replace consecutive duplicates in 2D numpy array - python

I have a two dimensional numpy array x:
import numpy as np
x = np.array([
[1, 2, 8, 4, 5, 5, 5, 3],
[0, 2, 2, 2, 2, 1, 1, 4]
])
My goal is to replace all consecutive duplicate numbers with a specific value (lets take -1), but by leaving one occurrence unchanged.
I could do this as follows:
def replace_consecutive_duplicates(x):
consec_dup = np.zeros(x.shape, dtype=bool)
consec_dup[:, 1:] = np.diff(x, axis=1) == 0
x[consec_dup] = -1
return x
# current output
replace_consecutive_duplicates(x)
# array([[ 1, 2, 8, 4, 5, -1, -1, 3],
# [ 0, 2, -1, -1, -1, 1, -1, 4]])
However, in this case the one occurrence left unchanged is always the first.
My goal is to leave the middle occurrence unchanged.
So given the same x as input, the desired output of function replace_consecutive_duplicates is:
# desired output
replace_consecutive_duplicates(x)
# array([[ 1, 2, 8, 4, -1, 5, -1, 3],
# [ 0, -1, 2, -1, -1, 1, -1, 4]])
Note that in case consecutive duplicate sequences with an even number of occurrences the middle left value should be unchanged. So the consecutive duplicate sequence [2, 2, 2, 2] in x[1] becomes [-1, 2, -1, -1]
Also note that I'm looking for a vectorized solution for 2D numpy arrays since performance is of absolute importance in my particular use case.
I've already tried looking at things like run length encoding and using np.diff(), but I didn't manage to solve this. Hope you guys can help!

The main problem is that you require the length of the number of consecutives values. This is not easy to get with numpy, but using itertools.groupby we can solve it using the following code.
import numpy as np
x = np.array([
[1, 2, 8, 4, 5, 5, 5, 3],
[0, 2, 2, 2, 2, 1, 1, 4]
])
def replace_row(arr: np.ndarray, new_val=-1):
results = []
for val, count in itertools.groupby(arr):
k = len(list(count))
results.extend([new_val] * ((k - 1) // 2))
results.append(val)
results.extend([new_val] * (k // 2))
return np.fromiter(results, arr.dtype)
if __name__ == '__main__':
for idx, row in enumerate(x):
x[idx, :] = replace_row(row)
print(x)
Output:
[[ 1 2 8 4 -1 5 -1 3]
[ 0 -1 2 -1 -1 1 -1 4]]
This isn't vectorized, but can be combined with multi threading since every row is handled one by one.

Related

python replace value in array based on previous and following value in column

given the following array, I want to replace the zero with their previous value columnwise as long as it is surrounded by two values greater than zero.
I am aware of np.where but it would consider the whole array instead of its columns.
I am not sure how to do it and help would be appreciated.
This is the array:
a=np.array([[4, 3, 3, 2],
[0, 0, 1, 2],
[0, 4, 2, 4],
[2, 4, 3, 0]])
and since the only zero that meets this condition is the second row/second column one,
the new array should be the following
new_a=np.array([[4, 3, 3, 2],
[0, 3, 1, 2],
[0, 4, 2, 4],
[2, 4, 3, 0]])
How do I accomplish this?
And what if I would like to extend the gap surrounded by nonzero ? For instance, the first column contains two 0 and the second column contains one 0, so the new array would be
new_a=np.array([[4, 3, 3, 2],
[4, 3, 1, 2],
[4, 4, 2, 4],
[2, 4, 3, 0]])
In short, how do I solve this if the columnwise condition would be the one of having N consecutive zeros or less?
As a generic method, I would approach this using a convolution:
from scipy.signal import convolve2d
# kernel for top/down neighbors
kernel = np.array([[1],
[0],
[1]])
# is the value a zero?
m1 = a==0
# count non-zeros neighbors
m2 = convolve2d(~m1, kernel, mode='same') > 1
mask = m1&m2
# replace matching values with previous row value
a[mask] = np.roll(a, 1, axis=0)[mask]
output:
array([[4, 3, 3, 2],
[0, 3, 1, 2],
[0, 4, 2, 4],
[2, 4, 3, 0]])
filling from surrounding values
Using pandas to benefit from ffill/bfill (you can forward-fill in pure numpy but its more complex):
import pandas as pd
df = pd.DataFrame(a)
# limit for neighbors
N = 2
# identify non-zeros
m = df.ne(0)
# mask zeros
m2 = m.where(m)
# mask for values with 2 neighbors within limits
mask = m2.ffill(limit=N) & m2.bfill(limit=N)
df.mask(mask&~m).ffill()
array([[4, 3, 3, 2],
[4, 3, 1, 2],
[4, 4, 2, 4],
[2, 4, 3, 0]])
That's one solution I found. I know it's basic but I think it works.
a=np.array([[4, 3, 3, 2],
[0, 0, 1, 2],
[0, 4, 2, 4],
[2, 4, 3, 0]])
a_t = a.T
for i in range(len(a_t)):
ar = a_t[i]
for j in range(len(ar)-1):
if (j>0) and (ar[j] == 0) and (ar[j+1] > 0):
a_t[i][j] = a_t[i][j-1]
a = a_t.T

numpy argmin vectorization

I'm trying to iterate over numpy rows, and put the index of each cluster of 3 elements that contains the lowest value into another row. This should be in the context of left, middle, right; the left and right edges only look at two values ('left and middle' or 'middle and right'), but everything in the middle should look at all 3.
For loops do this trivially, but it's very slow. Some kind of numpy vectorization would probably speed this up.
For example:
[1 18 3 6 2]
# should give the indices...
[0 0 2 4 4] # matching values 1 1 3 2 2
Slow for loop of an implementation:
for y in range(height):
for x in range(width):
i = 0 if x == 0 else x - 1
other_array[y,x] = np.argmin(array[y,i:x+2]) + i
NOTE: See update below for a solution with no for loops.
This works for an array of any number of dimensions:
def window_argmin(arr):
padded = np.pad(
arr,
[(0,)] * (arr.ndim-1) + [(1,)],
'constant',
constant_values=np.max(arr)+1,
)
slices = np.concatenate(
[
padded[..., np.newaxis, i:i+3]
for i in range(arr.shape[-1])
],
axis=-2,
)
return (
np.argmin(slices, axis=-1) +
np.arange(-1, arr.shape[-1]-1)
)
The code uses np.pad to pad the last dimension of the array with an extra number to the left and one to the right, so we can always use windows of 3 elements for the argmin. It sets the extra elements as max+1 so they'll never be picked by argmin.
Then it uses an np.concatenate of a list of slices to add a new dimension with each of 3-element windows. This is the only place we're using a for loop and we're only looping over the last dimension, once, to create the separate 3-element windows. (See update below for a solution that removes this for loop.)
Finally, we call np.argmin on each of the windows.
We need to adjust them, which we can do by adding the offset of the first element of the window (which is actually -1 for the first window, since it's a padded element.) We can do the adjustment with a simple sum of an arange array, which works with the broadcast.
Here's a test with your sample array:
>>> x = np.array([1, 18, 3, 6, 2])
>>> window_argmin(x)
array([0, 0, 2, 4, 4])
And a 3d example:
>>> z
array([[[ 1, 18, 3, 6, 2],
[ 1, 2, 3, 4, 5],
[ 3, 6, 19, 19, 7]],
[[ 1, 18, 3, 6, 2],
[99, 4, 4, 67, 2],
[ 9, 8, 7, 6, 3]]])
>>> window_argmin(z)
array([[[0, 0, 2, 4, 4],
[0, 0, 1, 2, 3],
[0, 0, 1, 4, 4]],
[[0, 0, 2, 4, 4],
[1, 1, 1, 4, 4],
[1, 2, 3, 4, 4]]])
UPDATE: Here's a version using stride_tricks that doesn't use any for loops:
def window_argmin(arr):
padded = np.pad(
arr,
[(0,)] * (arr.ndim-1) + [(1,)],
'constant',
constant_values=np.max(arr)+1,
)
slices = np.lib.stride_tricks.as_strided(
padded,
shape=arr.shape + (3,),
strides=padded.strides + (padded.strides[-1],),
)
return (
np.argmin(slices, axis=-1) +
np.arange(-1, arr.shape[-1]-1)
)
What helped me come up with the stride tricks solution was this numpy issue asking to add a sliding window function, linking to an example implementation of it, so I just adapted it for this specific case. It's still pretty much magic to me, but it works. 😁
Tested and works as expected for arrays of different numbers of dimensions.
import numpy as np
array = [1, 18, 3, 6, 2]
array.insert(0, np.max(array) + 1) # right shift of array
# [19, 1, 18, 3, 6, 2]
other_array = [ np.argmin(array[i-1:i+2]) + i - 2 for i in range(1, len(array)) ]
array.remove(np.max(array)) # original array
# [1, 18, 3, 6, 2]

Changing a number within a matrix

I randomly generate a matrix. Let's assume for simplicity that it is in the following form np.shape(A) = (2,4):
import numpy as np
A:
matrix([[ 1, 2, 3, 4],
[ 3, 4, 10, 8]])
Then, I estimate the following expression:
import numpy as np
K = 3
I = 4
C0 = np.sum(np.maximum(A[-1] - K, 0)) / I
The question is how do I input the following restriction: if any number of a column in the matrix A is less than or equal to (<=) K (3), then change the last number of that column to zero? So basically, my matrix should transform to this:
A:
matrix([[ 1, 2, 3, 4],
[ 0, 0, 0, 8]])
This is one way.
A[-1][np.any(A <= 3, axis=0)] = 0
# matrix([[1, 2, 3, 4],
# [0, 0, 0, 8]])
A[-1][np.any((A > 2) & (A <= 3), axis=0)] = 0
# matrix([[1, 2, 3, 4],
# [0, 4, 0, 8]])

Finding differences between all values in an List

I want to find the differences between all values in a numpy array and append it to a new list.
Example: a = [1,4,2,6]
result : newlist= [3,1,5,3,2,2,1,2,4,5,2,4]
i.e for each value i of a, determine difference between values of the rest of the list.
At this point I have been unable to find a solution
You can do this:
a = [1,4,2,6]
newlist = [abs(i-j) for i in a for j in a if i != j]
Output:
print newlist
[3, 1, 5, 3, 2, 2, 1, 2, 4, 5, 2, 4]
I believe what you are trying to do is to calculate absolute differences between elements of the input list, but excluding the self-differences. So, with that idea, this could be one vectorized approach also known as array programming -
# Input list
a = [1,4,2,6]
# Convert input list to a numpy array
arr = np.array(a)
# Calculate absolute differences between each element
# against all elements to give us a 2D array
sub_arr = np.abs(arr[:,None] - arr)
# Get diagonal indices for the 2D array
N = arr.size
rem_idx = np.arange(N)*(N+1)
# Remove the diagonal elements for the final output
out = np.delete(sub_arr,rem_idx)
Sample run to show the outputs at each step -
In [60]: a
Out[60]: [1, 4, 2, 6]
In [61]: arr
Out[61]: array([1, 4, 2, 6])
In [62]: sub_arr
Out[62]:
array([[0, 3, 1, 5],
[3, 0, 2, 2],
[1, 2, 0, 4],
[5, 2, 4, 0]])
In [63]: rem_idx
Out[63]: array([ 0, 5, 10, 15])
In [64]: out
Out[64]: array([3, 1, 5, 3, 2, 2, 1, 2, 4, 5, 2, 4])

Compare multiple columns in numpy array

I have a 2D numpy array with about 12 columns and 1000+ rows and each cell contains a number from 1 to 5. I'm searching for the best sextuple of columns according to my point system where 1 and 2 generate -1 point and 4 and 5 gives +1.
If a row in a certain sextuple contains, for example, [1, 4, 5, 3, 4, 3] the point for this row should be +2, because 3*1 + 1*(-1) = 2. Next row may be [1, 2, 2, 3, 3, 3] and should be -3 points.
At first, I tried a strait forward loop solution but I realized there are 665 280 possible combinations of columns to compare and when I also need to search for the best quintuple, quadruple etc. the loop is taking forever.
Is there perhaps a smarter numpy-way of solving my problem?
import numpy as np
import itertools
N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
x = np.array([0,-1,-1,0,1,1])
y = x[arr]
print(y)
score, best_sextuple = max((y[:,cols].sum(), cols)
for cols in itertools.combinations(range(12),6))
print('''\
score: {s}
sextuple: {c}
'''.format(s = score, c = best_sextuple))
yields, for example,
score: 6
sextuple: (0, 1, 5, 8, 10, 11)
Explanation:
First, let's generate a random example, with 12 columns and 10 rows:
N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
Now we can use numpy indexing to convert the numbers in arr 1,2,...,5 to the values -1,0,1 (according to your scoring system):
x = np.array([0,-1,-1,0,1,1])
y = x[arr]
Next, let's use itertools.combinations to generate all possible combinations of 6 columns:
for cols in itertools.combinations(range(12),6)
and
y[:,cols].sum()
then gives the score for cols, a choice of columns (sextuple).
Finally, use max to pick off the sextuple with the best score:
score, best_sextuple = max((y[:,cols].sum(), cols)
for cols in itertools.combinations(range(12),6))
import numpy
A = numpy.random.randint(1, 6, size=(1000, 12))
points = -1*(A == 1) + -1*(A == 2) + 1*(A == 4) + 1*(A == 5)
columnsums = numpy.sum(points, 0)
def best6(row):
return numpy.argsort(row)[-6:]
bestcolumns = best6(columnsums)
allbestcolumns = map(best6, points)
bestcolumns will now contain the best 6 columns in ascending order. By similar logic, allbestcolumns will contain the best six columns in each row.
Extending on unutbu's longer answer above, it's possible to generate the masked array of scores automatically. Since your scores for values are consistent every pass through the loop, so the scores for each value only need to be calculated once. Here's slightly inelegant way to do it on an example 6x10 array, before and after your scores are applied.
>>> import numpy
>>> values = numpy.random.randint(6, size=(6,10))
>>> values
array([[4, 5, 1, 2, 1, 4, 0, 1, 0, 4],
[2, 5, 2, 2, 3, 1, 3, 5, 3, 1],
[3, 3, 5, 4, 2, 1, 4, 0, 0, 1],
[2, 4, 0, 0, 4, 1, 4, 0, 1, 0],
[0, 4, 1, 2, 0, 3, 3, 5, 0, 1],
[2, 3, 3, 4, 0, 1, 1, 1, 3, 2]])
>>> b = values.copy()
>>> b[ b<3 ] = -1
>>> b[ b==3 ] = 0
>>> b[ b>3 ] = 1
>>> b
array([[ 1, 1, -1, -1, -1, 1, -1, -1, -1, 1],
[-1, 1, -1, -1, 0, -1, 0, 1, 0, -1],
[ 0, 0, 1, 1, -1, -1, 1, -1, -1, -1],
[-1, 1, -1, -1, 1, -1, 1, -1, -1, -1],
[-1, 1, -1, -1, -1, 0, 0, 1, -1, -1],
[-1, 0, 0, 1, -1, -1, -1, -1, 0, -1]])
Incidentally, this thread claims that creating the combinations directly within numpy will yield around 5x faster performance than itertools, though perhaps at the expense of some readability.

Categories