appending indices to numpy array - python

I make a variable corr_matrix by iterating over rows and columns and correlating values.
import numpy as np
import random
enc_dict = {k: int(random.uniform(1,24)) for k in range(24)}
ret_dict = {k: int(random.uniform(1,24)) for k in range(24)}
corr_matrix=np.zeros((24,24))
ind_matrix = np.zeros((24,24))
data = np.random.rand(24,24)
for enc_row in range(0,24):
for ret_col in range(0,24):
corr_matrix[enc_row, ret_col] = np.corrcoef(data[enc_row,:], data[ret_col,:])[0,1]
if enc_dict[enc_row] == ret_dict[ret_col]:
ind_matrix = np.append(ind_matrix, [[enc_row, ret_col]])
I want to store the indices in the matrix where enc_dict[enc_row] == ret_dict[ret_col] as a variable to use for indexing corr_matrix. I can print the values, but I can't figure out how to store them in a variable in a way that allows me to use them for indexing later.
I want to:
make a variable, ind_matrix that is the indices where the above statement is true.
I want to use ind_matrix to index within my correlation matrix. I want to be able to index the whole row as well as the exact value where the above statement is true (enc_dict[enc_row] == ret_dict[ret_col])
I tried ind_matrix = np.append(ind_matrix, [[enc_row, ret_col]]) which gives me the correct values but it has a lot of 0s before the #s for some reason. Also it doesn't allow me to call each pair of points together to use for indexing. I want to be able to do something like corr_matrix[ind_matrix[1]]

Here is a modified version of your code containing a couple of suggestions and comments:
import numpy as np
# when indices are 0, 1, 2, ... don't use dictionary
# also for integer values use randint
enc_ = np.random.randint(1, 24, (24,))
ret_ = np.random.randint(1, 24, (24,))
data = np.random.rand(24,24)
# np.corrcoef is vectorized, no need to loop:
corr_matrix = np.corrcoef(data)
# the following is the clearest, but maybe not the fastest way of generating
# your index array:
ind_matrix = np.argwhere(np.equal.outer(enc_, ret_))
# this can't be used for indexing directly, you'll have to choose
# one of the following idioms
# EITHER spread to two index arrays
I, J = ind_matrix.T
# or directly I, J = np.where(np.equal.outer(enc_, ret_))
# single index
print(corr_matrix[I[1], J[1]])
# multiple indices
print(corr_matrix[I[[1,2,0]], J[[1,2,0]]])
# whole row
print(corr_matrix[I[1]])
# OR use tuple conversion
ind_matrix = np.array(ind_matrix)
# single index
print(corr_matrix[(*ind_matrix[1],)])
# multiple indices
print(corr_matrix[(*zip(*ind_matrix[[1,2,0]],),)])
# whole row
print(corr_matrix[ind_matrix[1, 0]])
# OR if you do not plan to use multiple indices
as_tuple = list(map(tuple, ind_matrix))
# single index
print(corr_matrix[as_tuple[1]])
# whole row
print(corr_matrix[as_tuple[1][0]])

Related

Removing non-triple rows en masse in numpy array

I have an array A that has 1 million rows and 3 columns. In the last column there are unique integers that help identify data in the other two columns. I would only like to keep data that has three of the same unique integer occurrences, and delete all other rows that have other amounts of unique integer occurrences (i.e. for unique integers that are only appearing once, twice, or four times for example). Below is a function remove_loose_ends that I wrote to handle this. However, this function is being called many times and is the bottleneck of the entire program. Are there any possible enhancements that could remove the loop from this operation or decrease its runtime in other ways?
import numpy as np
import time
def remove_loose_ends(A):
# get unique counts
unique_id, unique_counter = np.unique(A[:, 2], return_counts=True)
# initialize outgoing indice mask
good_index = np.array([[True] * (A.shape[0])])
# loop through all indices and flip them to false if they match the not triplet entries
for i in range(0, len(unique_id)):
if unique_counter[i] != 3:
good_index = good_index ^ (A[:, 2] == unique_id[i])
# return incoming array with mask applied
return A[np.squeeze(good_index), :]
# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)
# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)
So, the main problem is that you loop over all the values twice basically, making it roughly an n² operation.
What you could do instead, is create an array of booleans directly from the output of the numpy.unique function to do the indexing for you.
For example, something like this:
import numpy as np
import time
def remove_loose_ends(A):
# get unique counts
_, unique_inverse, unique_counter = np.unique(A[:, 2], return_inverse=True, return_counts=True)
# Obtain boolean array of which integers occurred 3 times
idx = unique_counter == 3
# Obtain boolean array of which rows have integers that occurred 3 times
row_idx = idx[unique_inverse]
# return incoming array with mask applied
return A[row_idx, :]
# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)
# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)
I tried timing both versions.
The function you posted I stopped after 15 minutes, whereas the one I give takes around 0.15s on my PC.

Appending numpy array of arrays

I am trying to append an array to another array but its appending them as if it was just one array. What I would like to have is have each array appended on its own index, (withoug having to use a list, i want to use np arrays) i.e
temp = np.array([])
for i in my_items
m = get_item_ids(i.color) #returns an array as [1,4,20,5,3] (always same number of items but diff ids
temp = np.append(temp, m, axis=0)
On the second iteration lets suppose i get [5,4,15,3,10]
then i would like to have temp as
array([1,4,20,5,3][5,4,15,3,10])
But instead i keep getting [1,4,20,5,3,5,4,15,3,10]
I am new to python but i am sure there is probably a way to concatenate in this way with numpy without using lists?
You have to reshape m in order to have two dimension with
m.reshape(-1, 1)
thus adding the second dimension. Then you could concatenate along axis=1.
np.concatenate(temp, m, axis=1)
List append is much better - faster and easier to use correctly.
temp = []
for i in my_items
m = get_item_ids(i.color) #returns an array as [1,4,20,5,3] (always same number of items but diff ids
temp = m
Look at the list to see what it created. Then make an array from that:
arr = np.array(temp)
# or `np.vstack(temp)

Numpy: Finding correspondencies in one array by uniques of other array, arbitrary length

I have a problem where I have two arrays, one with identifiers which can occur multiple time, lets just say
import numpy as np
ind = np.random.randint(0,10,(100,))
and another one which is the same length and contains some info, in this case boolean, for each of the elementes identified by ind. They are sorted correspondingly.
dat = np.random.randint(0,2,(100,)).astype(np.bool8)
I'm looking for a (faster?) way to do the following: Do a np.any() for each element (defined by ind) for all elements. The number of occurences per element is, as in the example, random. What I'm doing now is
result = np.empty(np.unique(ind))
for i,uni in enumerate(np.unique(ind)):
result[i] = np.any(dat[ind==uni])
Which is sort of slow. Any ideas?
Approach #1
Index ind with dat to select the ones required to be checked, get the binned counts with np.bincount and see which bins have more one than occurrence -
result = np.bincount(ind[dat])>0
If ind has negative numbers, offset it with the min value -
ar = ind[dat]
result = np.bincount(ar-ar.min())>0
Approach #2
One more with np.unique -
unq = np.unique(ind[dat])
n = len(np.unique(ind))
result = np.zeros(n,dtype=bool)
result[unq] = 1
We can use pandas to get n :
import pandas as pd
n = pd.Series(ind).nunique()
Approach #3
One more with indexing -
ar = ind[dat]
result = np.zeros(ar.max()+1,dtype=bool)
result[ar] = 1

numpy unique always the same?

I'm using the following code snippet to get a list of unique arrays, but it reorders the list in a strange way. Is uniquecoords bound to be in the same order every time or is there any random factor?
for c in coordiantes:
coords.extend(c)
a = np.array(coords)
uniquecoords = np.unique(
a.view(
np.dtype( (np.void, a.dtype.itemsize*a.shape[1]) ))
).view(a.dtype).reshape(-1, a.shape[1])
According to the doc of numpy.unique(), the function "Returns the sorted unique elements of an array.". So the order should always be the same.
If you want to keep the original order, you can do
_, idx = np.unique(your_array_of_views, return_index=True)
uniquecoords = a[idx]

Python & Numpy - create dynamic, arbitrary subsets of ndarray

I am looking for a general way to do this:
raw_data = np.array(somedata)
filterColumn1 = raw_data[:,1]
filterColumn2 = raw_data[:,3]
cartesian_product = itertools.product(np.unique(filterColumn1), np.unique(filterColumn2))
for val1, val2 in cartesian_product:
fixed_mask = (filterColumn1 == val1) & (filterColumn2 == val2)
subset = raw_data[fixed_mask]
I want to be able to use any amount of filterColumns. So what I want is this:
filterColumns = [filterColumn1, filterColumn2, ...]
uniqueValues = map(np.unique, filterColumns)
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = ????
subset = raw_data[variable_mask]
Is there a simple syntax to do what I want? Otherwise, should I try a different approach?
Edit: This seems to be working
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = True
for idx, fc in enumerate(filterColumns):
variable_mask &= (fc == combination[idx])
subset = raw_data[variable_mask]
You could use numpy.all and index broadcasting for this
filter_matrix = np.array(filterColumns)
combination_array = np.array(combination)
bool_matrix = filter_matrix == combination_array[newaxis, :] #not sure of the newaxis position
subset = raw_data[bool_matrix]
There are however simpler ways of doing the same thing if your filters are within the matrix, notably through numpy argsort and numpy roll over an axis. First you roll axes until your axes until you've ordered your filters as first columns, then you sort on them and slice the array vertically to get the rest of the matrix.
In general if an for loop can be avoided in Python, better avoid it.
Update:
Here is the full code without a for loop:
import numpy as np
# select filtering indexes
filter_indexes = [1, 3]
# generate the test data
raw_data = np.random.randint(0, 4, size=(50,5))
# create a column that we would use for indexing
index_columns = raw_data[:, filter_indexes]
# sort the index columns by lexigraphic order over all the indexing columns
argsorts = np.lexsort(index_columns.T)
# sort both the index and the data column
sorted_index = index_columns[argsorts, :]
sorted_data = raw_data[argsorts, :]
# in each indexing column, find if number in row and row-1 are identical
# then group to check if all numbers in corresponding positions in row and row-1 are identical
autocorrelation = np.all(sorted_index[1:, :] == sorted_index[:-1, :], axis=1)
# find out the breakpoints: these are the positions where row and row-1 are not identical
breakpoints = np.nonzero(np.logical_not(autocorrelation))[0]+1
# finally find the desired subsets
subsets = np.split(sorted_data, breakpoints)
An alternative implementation would be to transform the indexing matrix into a string matrix, sum row-wise, get an argsort over the now unique indexing column and split as above.
For conveniece, it might be more interesting to first roll the indexing matrix until they are all in the beginning of the matrix, so that the sorting done above is clear.
Something like this?
variable_mask = np.ones_like(filterColumns[0]) # select all rows initially
for column, val in zip(filterColumns, combination):
variable_mask &= (column == val)
subset = raw_data[variable_mask]

Categories