Masking few non-zero elements of certain rows of a matrix - python

I have a 3*3 matrix with 1s and 0s, A = [[1,0,1],[0,1,1],[1,0,0]] and an array indicating the limit on the row sum, B = [1,2,1]. I want to find rows of A whose sum exceeds the corresponding value in B, and set the non-zero elements of A to zero to ensure that the sum matches with B. Finding the rows of A that exceed the sum is easy, however masking the elements to adjust the sum is what I need help with.How can this be achieved (want to scale it to larger matrices and tensors)?

I would do smth like this:
import numpy as np
A = np.array([[1,0,1],[0,1,1],[1,0,0]])
B = np.array([1,2,1])
# a cumulative sum of each row will tell you how many
# ones were in that row up to each point.
A_cs = np.cumsum(A, axis = 1)
# theresholding according to the sum vector will let
# you know where you should start omitting values since
# at that point the sum of the row exceeds its limit.
A_th = A_cs > B[:, None]
# then you can use the boolean array to create a new
# array where the appropriate values in the original
# array are set to zero to reduce the row sum.
A_nw = A * (1 - A_th)
output:
A_nw =
[[1 0 0]
[0 1 1]
[1 0 0]]
Unrelated note:
The following note is here to help OP better their dev-related search skills.
I can answer some questions instantaneously, but this was not one of them. I'm telling you this because I reached the answer through a simple google search for "python find the i th non zero element in each row", which led me to this post which in turn led me very quickly to an answer. You don't have to try to be a better, more independet code writer. But if you want to, know that you can.

Related

Pandas random n samples of consecutive rows / pairs

I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array

Creating a matrix with at most one value in every row and column

From a matrix filled with values (see picture), I want to obtain a matrix with at most one value for every row and column. If there is more than one value, the maximum should be kept and the other set to 0. I know I can do that with np.max and np.argmax, but I'm wondering if there is some clever way to do it that I'm not aware of.
Here's the solution I have for now:
tmp = np.zeros_like(matrix)
for x in np.argmax(matrix, axis=0): # get max on x axis
for y in np.argmax(matrix, axis=1): # get max on y axis
tmp[x][y] = matrix[x][y]
matrix = tmp
The sparse structure may be used for efficiency, however right now I see a contradiction between at most one value for every row and column and your current implementation which may leave more than one value per row/column.
Either you need an order to prefer rows over columns or to go along an absolute sorting of all matrix values.
Just an idea which produces for sure at most one entry per row and column would be to, firstly select the maxima of rows, and secondly select from this intermediate matrix the maxima of columns:
import numpy as np
rows=5
cols=5
matrix=np.random.rand(rows, cols)
rowmax=np.argmax(matrix, 1)
rowmax_matrix=np.zeros((rows, cols))
for ri, rm in enumerate(rowmax):
rowmax_matrix[ri,rm]=matrix[ri, rm]
colrowmax=np.argmax(rowmax_matrix, 0)
colrowmax_matrix=np.zeros((rows, cols))
for ci, cm in enumerate(colrowmax):
colrowmax_matrix[cm, ci]=rowmax_matrix[cm, ci]
This is probably not the final answer, but may help to formulate the desired algorithm precisely.

Integer matrix to stochastic matrix normalization

Suppose I have matrix with integer values. I want to make it stochastic matrix (i.e. sum of each row in matrix equal to 1)
I create random matrix, count sum of each row and divide each element in row for row sum.
dt = pd.DataFrame(np.random.randint(0,10000,size=10000).reshape(100,100))
dt['sum_row'] = dt.sum(axis=1)
for col_n in dt.columns[:-1]:
dt[col_n] = dt[col_n] / dt['sum_row']
After this sum of each row should be equal to 1. But it is not.
(dt.sum_row_normalized == 1).value_counts()
> False 75
> True 25
> Name: sum_row_normalized, dtype: int64
I understand that some values is not exactly 1 but very close to it. Nevertheless, how can I normalize matrix correctly?
You can't guarantee the floats will be exactly one, but you can check the closely to an arbitrary precision with np.around.
This is probably easier/faster without looping through pandas columns.
X = np.random.randint(0,10000,size=10000).reshape(100,100)
X_float = X.astype(float)
Y = X_float/X_float.sum(axis=1)[:,np.newaxis]
sum(np.around(Y.sum(axis=1),decimals=10)==1) # is 100
(you don't need the .astype(float) step in python 3.x)

How to sum over columns with some weight in a csr matrix in python

If I have a large csr_matrix A, I want to sum over its columns, simply
A.sum(axis=0)
does this for me, right? Are the corresponding axis values: 1->rows, 0->columns?
I stuck when I want to sum over columns with some weights which are specified in a list, e.g. [1 2 3 4 5 4 3 ... 4 2 5] with the same length as the number of rows in the csr_matrix A. To be more clear, I want the inner product of each column vector with this weight vector. How can I achieve this with Python?
This is a part of my code:
uniFeature = csr_matrix(uniFeature)
[I,J] = uniFeature.shape
sumfreq = uniFeature.sum(axis=0)
sumratings = []
for j in range(J):
column = uniFeature.getcol(j)
column = column.toarray()
sumtemp = np.dot(ratings,column)
sumratings.append(sumtemp)
sumfreq = sumfreq.toarray()
average = np.true_divide(sumratings,sumfreq)
(Numpy is imported as np) There is a weight vector "ratings", the program is supposed to output the average rating for each column of the matrix "uniFeature".
I experimented to dot column=uniFeature.getcol(j) directly with ratings(which is a list), but there is an error that says format does not agree. It's ok after column.toarray() then dot with ratings. But isn't making each column back to dense form losing the point of having the sparse matrix and would be very slow? I ran the above code and it's too slow to show the results. I guess there should be a way that dots the vector "ratings" with each column of the sparse matrix efficiently.
Thanks in advance!

Numpy signed maximum magnitude of cumsum along an axis

I have a numpy array a, a.shape=(17,90,144). I want to find the maximum magnitude of each column of cumsum(a, axis=0), but retaining the original sign. In other words, if for a given column a[:,j,i] the largest magnitude of cumsum corresponds to a negative value, I want to retain the minus sign.
The code np.amax(np.abs(a.cumsum(axis=0))) gets me the magnitude, but doesn't retain the sign. Using np.argmax instead will get me the indices I need, which I can then plug into the original cumsum array. But I can't find a good way to do so.
The following code works, but is dirty and really slow:
max_mag_signed = np.zeros((90,144))
indices = np.argmax(np.abs(a.cumsum(axis=0)), axis=0)
for j in range(90):
for i in range(144):
max_mag_signed[j,i] = a.cumsum(axis=0)[indices[j,i],j,i]
There must be a cleaner, faster way to do this. Any ideas?
I can't find any alternatives to argmax but at least you can fasten that with a more vectorized approach:
# store the cumsum, since it's used multiple times
cum_a = a.cumsum(axis=0)
# find the indices as before
indices = np.argmax(abs(cum_a), axis=0)
# construct the indices for the second and third dimensions
y, z = np.indices(indices.shape)
# get the values with np indexing
max_mag_signed = cum_a[indices, y, z]

Categories