Python & Numpy - create dynamic, arbitrary subsets of ndarray - python

I am looking for a general way to do this:
raw_data = np.array(somedata)
filterColumn1 = raw_data[:,1]
filterColumn2 = raw_data[:,3]
cartesian_product = itertools.product(np.unique(filterColumn1), np.unique(filterColumn2))
for val1, val2 in cartesian_product:
fixed_mask = (filterColumn1 == val1) & (filterColumn2 == val2)
subset = raw_data[fixed_mask]
I want to be able to use any amount of filterColumns. So what I want is this:
filterColumns = [filterColumn1, filterColumn2, ...]
uniqueValues = map(np.unique, filterColumns)
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = ????
subset = raw_data[variable_mask]
Is there a simple syntax to do what I want? Otherwise, should I try a different approach?
Edit: This seems to be working
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = True
for idx, fc in enumerate(filterColumns):
variable_mask &= (fc == combination[idx])
subset = raw_data[variable_mask]

You could use numpy.all and index broadcasting for this
filter_matrix = np.array(filterColumns)
combination_array = np.array(combination)
bool_matrix = filter_matrix == combination_array[newaxis, :] #not sure of the newaxis position
subset = raw_data[bool_matrix]
There are however simpler ways of doing the same thing if your filters are within the matrix, notably through numpy argsort and numpy roll over an axis. First you roll axes until your axes until you've ordered your filters as first columns, then you sort on them and slice the array vertically to get the rest of the matrix.
In general if an for loop can be avoided in Python, better avoid it.
Update:
Here is the full code without a for loop:
import numpy as np
# select filtering indexes
filter_indexes = [1, 3]
# generate the test data
raw_data = np.random.randint(0, 4, size=(50,5))
# create a column that we would use for indexing
index_columns = raw_data[:, filter_indexes]
# sort the index columns by lexigraphic order over all the indexing columns
argsorts = np.lexsort(index_columns.T)
# sort both the index and the data column
sorted_index = index_columns[argsorts, :]
sorted_data = raw_data[argsorts, :]
# in each indexing column, find if number in row and row-1 are identical
# then group to check if all numbers in corresponding positions in row and row-1 are identical
autocorrelation = np.all(sorted_index[1:, :] == sorted_index[:-1, :], axis=1)
# find out the breakpoints: these are the positions where row and row-1 are not identical
breakpoints = np.nonzero(np.logical_not(autocorrelation))[0]+1
# finally find the desired subsets
subsets = np.split(sorted_data, breakpoints)
An alternative implementation would be to transform the indexing matrix into a string matrix, sum row-wise, get an argsort over the now unique indexing column and split as above.
For conveniece, it might be more interesting to first roll the indexing matrix until they are all in the beginning of the matrix, so that the sorting done above is clear.

Something like this?
variable_mask = np.ones_like(filterColumns[0]) # select all rows initially
for column, val in zip(filterColumns, combination):
variable_mask &= (column == val)
subset = raw_data[variable_mask]

Related

Mask python array based on multiple column indices

I have a 64*64 array, and would like to mask certain columns. For one column I know I can do:
mask = np.tri(64,k=0,dtype=bool)
col = np.zeros((64,64),bool)
col[:,3] = True
col_mask = col + np.transpose(col)
col_mask = np.tril(col_mask)
col_mask = col_mask[mask]
but how to extend this to multiple indices? I have tried doing col[:,1] & col[:,2] = True but got
SyntaxError: cannot assign to operator
Also I might have up to 10 columns I would like to mask, so is there a less unwieldily approach? I have also looked at numpy.indices but I don't think this is what I need. Thank you!
You can update multiple indices at the same time:
idx = [1,2,3,10,50]
col[:,idx] = True

How can I compare two matrices row-wise in python?

I have two matrices with the same number of columns but a different number of rows, one is a lot larger.
matA = [[1,0,1],[0,0,0],[1,1,0]], matB = [[0,0,0],[1,0,1],[0,0,0],[1,1,1],[1,1,0]]
Both of them are numpy matrices
I am trying to find how many times each row of matA appears in matB and put that in an array so the array in this case will become arr = [1,2,1] because the first row of matA appeared one time in mat, the second row appeared two times and the last row only one time
Find unique rows in numpy.array
What is a faster way to get the location of unique rows in numpy
Here is a solution:
import numpy as np
A = np.array([[1,0,1],[0,0,0],[1,1,0]])
B = np.array([[0,0,0],[1,0,1],[0,0,0],[1,1,1],[1,1,0]])
# stack the rows, A has to be first
combined = np.concatenate((A, B), axis=0) #or np.vstack
unique, unique_indices, unique_counts = np.unique(combined,
return_index=True,
return_counts=True,
axis=0)
print(unique)
print(unique_indices)
print(unique_counts)
# now we need to derive your desired result from the unique
# indices and counts
# we know the number of rows in A
n_rows_in_A = A.shape[0]
# so we know that the indices from 0 to (n_rows_in_A - 1)
# in unique_indices are rows that appear first or only in A
indices_A = np.nonzero(unique_indices < n_rows_in_A)[0] #first
#indices_A1 = np.argwhere(unique_indices < n_rows_in_A)
print(indices_A)
#print(indices_A1)
unique_indices_A = unique_indices[indices_A]
unique_counts_A = unique_counts[indices_A]
print(unique_indices_A)
print(unique_counts_A)
# now we need to subtract one count from the unique_counts
# that's the one occurence in A that we are not interested in.
unique_counts_A -= 1
print(unique_indices_A)
print(unique_counts_A)
# this is nearly the result we want
# now we need to sort it and account for rows that are not
# appearing in A but in B
# will do that later...

Numpy: Finding correspondencies in one array by uniques of other array, arbitrary length

I have a problem where I have two arrays, one with identifiers which can occur multiple time, lets just say
import numpy as np
ind = np.random.randint(0,10,(100,))
and another one which is the same length and contains some info, in this case boolean, for each of the elementes identified by ind. They are sorted correspondingly.
dat = np.random.randint(0,2,(100,)).astype(np.bool8)
I'm looking for a (faster?) way to do the following: Do a np.any() for each element (defined by ind) for all elements. The number of occurences per element is, as in the example, random. What I'm doing now is
result = np.empty(np.unique(ind))
for i,uni in enumerate(np.unique(ind)):
result[i] = np.any(dat[ind==uni])
Which is sort of slow. Any ideas?
Approach #1
Index ind with dat to select the ones required to be checked, get the binned counts with np.bincount and see which bins have more one than occurrence -
result = np.bincount(ind[dat])>0
If ind has negative numbers, offset it with the min value -
ar = ind[dat]
result = np.bincount(ar-ar.min())>0
Approach #2
One more with np.unique -
unq = np.unique(ind[dat])
n = len(np.unique(ind))
result = np.zeros(n,dtype=bool)
result[unq] = 1
We can use pandas to get n :
import pandas as pd
n = pd.Series(ind).nunique()
Approach #3
One more with indexing -
ar = ind[dat]
result = np.zeros(ar.max()+1,dtype=bool)
result[ar] = 1

appending indices to numpy array

I make a variable corr_matrix by iterating over rows and columns and correlating values.
import numpy as np
import random
enc_dict = {k: int(random.uniform(1,24)) for k in range(24)}
ret_dict = {k: int(random.uniform(1,24)) for k in range(24)}
corr_matrix=np.zeros((24,24))
ind_matrix = np.zeros((24,24))
data = np.random.rand(24,24)
for enc_row in range(0,24):
for ret_col in range(0,24):
corr_matrix[enc_row, ret_col] = np.corrcoef(data[enc_row,:], data[ret_col,:])[0,1]
if enc_dict[enc_row] == ret_dict[ret_col]:
ind_matrix = np.append(ind_matrix, [[enc_row, ret_col]])
I want to store the indices in the matrix where enc_dict[enc_row] == ret_dict[ret_col] as a variable to use for indexing corr_matrix. I can print the values, but I can't figure out how to store them in a variable in a way that allows me to use them for indexing later.
I want to:
make a variable, ind_matrix that is the indices where the above statement is true.
I want to use ind_matrix to index within my correlation matrix. I want to be able to index the whole row as well as the exact value where the above statement is true (enc_dict[enc_row] == ret_dict[ret_col])
I tried ind_matrix = np.append(ind_matrix, [[enc_row, ret_col]]) which gives me the correct values but it has a lot of 0s before the #s for some reason. Also it doesn't allow me to call each pair of points together to use for indexing. I want to be able to do something like corr_matrix[ind_matrix[1]]
Here is a modified version of your code containing a couple of suggestions and comments:
import numpy as np
# when indices are 0, 1, 2, ... don't use dictionary
# also for integer values use randint
enc_ = np.random.randint(1, 24, (24,))
ret_ = np.random.randint(1, 24, (24,))
data = np.random.rand(24,24)
# np.corrcoef is vectorized, no need to loop:
corr_matrix = np.corrcoef(data)
# the following is the clearest, but maybe not the fastest way of generating
# your index array:
ind_matrix = np.argwhere(np.equal.outer(enc_, ret_))
# this can't be used for indexing directly, you'll have to choose
# one of the following idioms
# EITHER spread to two index arrays
I, J = ind_matrix.T
# or directly I, J = np.where(np.equal.outer(enc_, ret_))
# single index
print(corr_matrix[I[1], J[1]])
# multiple indices
print(corr_matrix[I[[1,2,0]], J[[1,2,0]]])
# whole row
print(corr_matrix[I[1]])
# OR use tuple conversion
ind_matrix = np.array(ind_matrix)
# single index
print(corr_matrix[(*ind_matrix[1],)])
# multiple indices
print(corr_matrix[(*zip(*ind_matrix[[1,2,0]],),)])
# whole row
print(corr_matrix[ind_matrix[1, 0]])
# OR if you do not plan to use multiple indices
as_tuple = list(map(tuple, ind_matrix))
# single index
print(corr_matrix[as_tuple[1]])
# whole row
print(corr_matrix[as_tuple[1][0]])

Complex Filtering of DataFrame

I've just started working with Pandas and I am trying to figure if it is the right tool for my problem.
I have a dataset:
date, sourceid, destid, h1..h12
I am basically interested in the sum of each of the H1..H12 columns, but, I need to exclude multiple ranges from the dataset.
Examples would be to:
exclude H4, H5, H6 data where sourceid = 4944 and exclude H8, H9-H12
where destination = 481981 and ...
... this can go on for many many filters as we are
constantly removing data to get close to our final model.
I think I saw in a solution that I could build a list of the filters I would want and then create a function to test against, but I haven't found a good example to work from.
My initial thought was to create a copy of the df and just remove the data we didn't want and if we need it back - we could just copy it back in from the origin df, but that seems like the wrong road.
By using masks, you don't have to remove data from the dataframe. E.g.:
mask1 = df.sourceid == 4944
var1 = df[mask1]['H4','H5','H6'].sum()
Or directly do:
var1 = df[df.sourceid == 4944]['H4','H5','H6'].sum()
In case of multiple filters, you can combine the Boolean masks with Boolean operators:
totmask = mask1 & mask2
you can use DataFrame.ix[] to set the data to zeros.
Create a dummy DataFrame first:
N = 10000
df = pd.DataFrame(np.random.rand(N, 12), columns=["h%d" % i for i in range(1, 13)], index=["row%d" % i for i in range(1, N+1)])
df["sourceid"] = np.random.randint(0, 50, N)
df["destid"] = np.random.randint(0, 50, N)
Then for each of your filter you can call:
df.ix[df.sourceid == 10, "h4":"h6"] = 0
since you have 600k rows, create a mask array by df.sourceid == 10 maybe slow. You can create Series objects that map value to the index of the DataFrame:
sourceid = pd.Series(df.index.values, index=df["sourceid"].values).sort_index()
destid = pd.Series(df.index.values, index=df["destid"].values).sort_index()
and then exclude h4,h5,h6 where sourceid == 10 by:
df.ix[sourceid[10], "h4":"h6"] = 0
to find row ids where sourceid == 10 and destid == 20:
np.intersect1d(sourceid[10].values, destid[20].values, assume_unique=True)
to find row ids where 10 <= sourceid <= 12 and 3 <= destid <= 5:
np.intersect1d(sourceid.ix[10:12].values, destid.ix[3:5].values, assume_unique=True)
sourceid and destid are Series with duplicated index values, when the index values are in order, Pandas use searchsorted to find index. it's O(log N), faster then create mask arrays which is O(N).

Categories