Numpy Array summing with weights - python

I have a two dimensional numpy array.
Each row is three elements long and is an integer 0-3. This represents a 6 bit integer, with each cell representing two bits, in order.
I'm trying to transform them into the full integer.
for i in range(len(myarray)):
myarray[i] = myarray[i][0] * 16 + myarray[i][1] * 4 + myarray[i][2]
E.g. I'm trying to sum each row but according to a certain weight vector of [16,4,1].
What is the most elegant way to do this? I'm thinking I have to do some sort of dot product followed by a sum, but I'm not 100% confident where to do the dot.

The dot product inclination is correct, and that includes the sum you need. So, to get the sum of the products of the elements of a target array and a set of weights:
>>> a = np.array([[0,1,2],[2,2,3]])
>>> a
array([[0, 1, 2],
[2, 2, 3]])
>>> weights = np.array([16,4,2])
array([ 8, 46])


Get column indices of row-wise maximum values of a 2D array (with random tie-breaking)

Given a 2D numpy array, I want to construct an array out of the column indices of the maximum value of each row. So far, arr.argmax(1) works well. However, for my specific case, for some rows, 2 or more columns may contain the maximum value. In that case, I want to select a column index randomly (not the first index as it is the case with .argmax(1)).
For example, for the following arr:
arr = np.array([
[0, 1, 0],
[1, 1, 0],
[2, 1, 3],
[3, 2, 2]
there can be two possible outcomes: array([1, 0, 2, 0]) and array([1, 1, 2, 0]) each chosen with 1/2 probability.
I have code that returns the expected output using a list comprehension:
idx = np.arange(arr.shape[1])
ans = [np.random.choice(idx[ix]) for ix in arr == arr.max(1, keepdims=True)]
but I'm looking for an optimized numpy solution. In other words, how do I replace the list comprehension with numpy methods to make the code feasible for bigger arrays?
Use scipy.stats.rankdata and apply_along_axis as follows.
import numpy as np
from scipy.stats import rankdata
ranks = rankdata(-arr, axis = 1, method = "min")
func = lambda x: np.random.choice(np.where(x==1)[0])
idx = np.apply_along_axis(func, 1, ranks)
It returns [1 0 2 0] or [1 1 2 0].
The main idea is rankdata calculates ranks of every value in each row, and the maximum value will have 1. func randomly choices one of index whose corresponding value is 1. Finally, apply_along_axis applies the func to every row of arr.
After some advice I got offline, it turns out that randomization of maximum values are possible when we multiply the boolean array that flags row-wise maximum values by a random array of the same shape. Then what remains is a simple argmax(1) call.
# boolean array that flags maximum values of each row
mxs = arr == arr.max(1, keepdims=True)
# random array where non-maximum values are zero and maximum values are random values
random_arr = np.random.rand(*arr.shape) * mxs
# row-wise maximum of the auxiliary array
ans = random_arr.argmax(1)
A timeit test shows that for data of shape (507_563, 12), this code runs in ~172 ms on my machine while the loop in the question runs for 11 sec, so this is about 63x faster.

Vectorized relative complement of sets in numpy

I have np.arange(n) A and a numpy array B of its non-intersecting subarrays - division of the initial array into k arrays of consecutive numbers.
One example would be:
A = [0, 1, 2, 3, 4, 5, 6]
B = [[0, 1], [2, 3, 4], [5, 6]]
For every subarray C of B I have to calculate A\C (where \ is operation on sets, so the result is a numpy array of all elements of A which are not in B).
My current solution hits time limit:
import numpy as np
for C in B:
ans.append(np.setdiff1d(A, C))
return ans
I'd like to speed up it by using vectorization, but I have no idea how to. I've tried to remove the cycle, leaving only functions like setxor1d and setdiff1d, but failed.
I assume A and the subarrays of B are sorted and have unique elements. Then for my below example of 10**6 integers divided into 100 subarrays generated by the following code.
A = np.sort(np.unique(np.random.randint(0,10**10,10**6)))
B = np.split(A, np.sort(np.random.randint(0,10**6-1,99)))
You can cut the time in half by setting unique=True. And cut that time by a factor of 3 on top of that by only doing the setminus in for the numbers in A that lie between the biggest and smallest number in the particular subset of B. I realize that my example is the optimal case for this optimization to help so am not sure how that will be for your real world example. You will have to try.
boundaries = [x[i] for x in B for i in [0,-1]]
boundary_idx = np.searchsorted(A, boundaries).reshape(-1,2)
np.setdiff1d(A[x[0]:x[1]+1], b, assume_unique=True),
for b,x in zip(B, boundary_idx)]

Optimize testing all combinations of rows from multiple NumPy arrays

I have three NumPy arrays of ints, same number of columns, arbitrary number of rows each. I am interested in all instances where a row of the first one plus a row of the second one gives a row of the third one ([3, 1, 4] + [1, 5, 9] = [4, 6, 13]).
Here is a pseudo-code:
for i, j in rows(array1), rows(array2):
if i + j is in rows(array3):
somehow store the rows this occured at (eg. (1,2,5) if 1st row of
array1 + 2nd row of array2 give 5th row of array3)
I will need to run this for very big matrices so I have two questions:
(1) I can write the above using nested loops but is there a quicker way, perhaps list comprehensions or itertools?
(2) What is the fastest/most memory-efficient way to store the triples? Later I will need to create a heatmap using two as coordinates and the first one as the corresponding value eg. point (2,5) has value 1 in the pseudo-code example.
Would be very grateful for any tips - I know this sounds quite simple but it needs to run fast and I have very little experience with optimization.
edit: My ugly code was requested in comments
import numpy as np
#random arrays
A = np.array([[-1,0],[0,-1],[4,1], [-1,2]])
B = np.array([[1,2],[0,3],[3,1]])
C = np.array([[0,2],[2,3]])
#triples stored as numbers with 2 coordinates in a otherwise-zero matrix
output_matrix = np.zeros((B.shape[0], C.shape[0]), dtype = int)
for i in range(A.shape[0]):
for j in range(B.shape[0]):
for k in range(C.shape[0]):
if np.array_equal((A[i,] + B[j,]), C[k,]):
output_matrix[j, k] = i+1
We can leverage broadcasting to perform all those summations and comparison in a vectorized manner and then use np.where on it to get the indices corresponding to the matching ones and finally index and assign -
output_matrix = np.zeros((B.shape[0], C.shape[0]), dtype = int)
mask = ((A[:,None,None,:] + B[None,:,None,:]) == C).all(-1)
I,J,K = np.where(mask)
output_matrix[J,K] = I+1
(1) Improvements
You can use sets for the final result in the third matrix, as a + b = c must hold identically. This already replaces one nested loop with a constant-time lookup. I will show you an example of how to do this below, but we first ought to introduce some notation.
For a set-based approach to work, we need a hashable type. Lists will thus not work, but a tuple will: it is an ordered, immutable structure. There is, however, a problem: tuple addition is defined as appending, that is,
(0, 1) + (1, 0) = (0, 1, 1, 0).
This will not do for our use-case: we need element-wise addition. As such, we subclass the built-in tuple as follows,
class AdditionTuple(tuple):
def __add__(self, other):
Element-wise addition.
if len(self) != len(other):
raise ValueError("Undefined behaviour!")
return AdditionTuple(self[idx] + other[idx]
for idx in range(len(self)))
Where we override the default behaviour of __add__. Now that we have a data-type amenable to our problem, let's prepare the data.
You give us,
A = [[-1, 0], [0, -1], [4, 1], [-1, 2]]
B = [[1, 2], [0, 3], [3, 1]]
C = [[0, 2], [2, 3]]
To work with. I say,
from types import SimpleNamespace
A = [AdditionTuple(item) for item in A]
B = [AdditionTuple(item) for item in B]
C = {tuple(item): SimpleNamespace(idx=idx, values=[])
for idx, item in enumerate(C)}
That is, we modify A and B to use our new data-type, and turn C into a dictionary which supports (amortised) O(1) look-up times.
We can now do the following, eliminating one loop altogether,
from itertools import product
for a, b in product(enumerate(A), enumerate(B)):
idx_a, a_i = a
idx_b, b_j = b
if a_i + b_j in C: # a_i + b_j == c_k, identically
C[a_i + b_j].values.append((idx_a, idx_b))
{(2, 3): namespace(idx=1, values=[(3, 2)]), (0, 2): namespace(idx=0, values=[(0, 0), (1, 1)])}
Where for each value in C, you get the index of that value (as idx), and a list of tuples of (idx_a, idx_b) whose elements of A and B together sum to the value at idx in C.
Let us briefly analyse the complexity of this algorithm. Redefining the lists A, B, and C as above is linear in the length of the lists. Iterating over A and B is of course in O(|A| * |B|), and the nested condition computes the element-wise addition of the tuples: this is linear in the length of the tuples themselves, which we shall denote k. The whole algorithm then runs in O(k * |A| * |B|).
This is a substantial improvement over your current O(k * |A| * |B| * |C|) algorithm.
(2) Matrix plotting
Use a dok_matrix, a sparse SciPy matrix representation. Then you can use any heatmap-plotting library you like on the matrix, e.g. Seaborn's heatmap.

In Python. I have a list of ND arrays and I want to count duplicate arrays in order to calculate an Average for each Duplicate array value

I have a list of ND arrays(vectors), each vector has a (1,300) shape.
My goal is to find duplicate vectors inside a list, to sum them and then divide them by the size of a list, the result value(a vector) will replace the duplicate vector.
For example, a is a list of ND arrays, a = [[2,3,1],[5,65,-1],[2,3,1]], then the first and the last element are duplicates.
their sum would be :[4,6,2],
which will be divided by the size of a list of vectors, size = 3.
Output: a = [[4/3,6/3,2/3],[5,65,-1],[4/3,6/3,2/3]]
I have tried to use a Counter but it doesn't work for ndarrays.
What is the Numpy way?
If you have numpy 1.13 or higher, this is pretty simple:
def f(a):
u, inv, c = np.unique(a, return_counts = True, return_inverse = True, axis = 0)
p = np.where(c > 1, c / a.shape[0], 1)[:, None]
return (u * p)[inv]
If you don't have 1.13, you'll need some trick to convert a into a 1-d array first. I recommend #Jaime's excellent answer using np.void here
How it works:
u is the unique rows of a (usually not in their original order)
c is the number of times each row of u are repeated in a
inv is the indices to get u back to a, i.e. u[inv] = a
p is the multiplier for each row of u based on your requirements. 1 if c == 1 and c / n (where n is the number of rows in a) if c > 1. [:, None] turns it into a column vector so that it broadcasts well with u
return u * p indexed back to their original locations by [inv]
You can use numpy unique , with count return count
elements, count = np.unique(a, axis=0, return_counts=True)
Return Count allow to return the number of occurrence of each element in the array
The output is like this ,
(array([[ 2, 3, 1],
[ 5, 65, -1]]), array([2, 1]))
Then you can multiply them like this :
(count * elements.T).T
Output :
array([[ 4, 6, 2],
[ 5, 65, -1]])

Acquiring the Minimum array out of Multiple Arrays by order in Python

Say that I have 4 numpy arrays
In this case, I've determined [1,2,3] is the "minimum array" for my purposes, as it is one of two arrays with lowest value at index 0, and of those two arrays it has the the lowest index 1. If there were more arrays with similar values, I would need to compare the next index values, and so on.
How can I extract the array [1,2,3] in that same order from the pile?
How can I extend that to x arrays of size n?
Using the python non-numpy .sort() or sorted() on a list of lists (not numpy arrays) automatically does this e.g.
a = [[1,2,3],[2,3,1],[3,2,1],[1,3,2]]
The numpy sort seems to only sort the subarrays recursively so it seems the best way would be to convert it to a python list first. Assuming you have an array of arrays you want to pick the minimum of you could get the minimum as
As someone pointed out you could also do min(a.tolist()) which uses the same type of comparisons as sort, and would be faster for large arrays (linear vs n log n asymptotic run time).
Here's an idea using numpy:
import numpy
a = numpy.array([[1,2,3],[2,3,1],[3,2,1],[1,3,2]])
col = 0
while a.shape[0] > 1:
b = numpy.argmin(a[:,col:], axis=1)
a = a[b == numpy.min(b)]
col += 1
print a
This checks column by column until only one row is left.
numpy's lexsort is close to what you want. It sorts on the last key first, but that's easy to get around:
>>> a = np.array([[1,2,3],[2,3,1],[3,2,1],[1,3,2]])
>>> order = np.lexsort(a[:, ::-1].T)
>>> order
array([0, 3, 1, 2])
>>> a[order]
array([[1, 2, 3],
[1, 3, 2],
[2, 3, 1],
[3, 2, 1]])
