Replacing non zero values in a matrix with the marginals - python

I am trying to do some math with my matrix, i can write it down but i am not sure how to code it. This involves getting a column of row marginal values, then making a new matrix that has all non-zero row values replaced with the marginals, after that I would like to divide the sum of non zero new values to be the column marginals.
I can get to the row marginals but I cant seem to think of a way to repopulate.
example of what i want
import numpy as np
matrix = np.matrix([[1,3,0],[0,1,2],[1,0,4]])
matrix([[1, 3, 0],
[0, 1, 2],
[1, 0, 4]])
marginals = ((matrix != 0).sum(1) / matrix.sum(1))
matrix([[0.5 ],
[0.66666667],
[0.4 ]])
What I want done next is a filling of the matrix based on the non zero locations of the first.
matrix([[0.5, 0.5, 0],
[0, 0.667, 0.667],
[0.4, 0, 0.4]])
Final wanted result is the new matrix column sum divided by the number of non zero occurrences in that column.
matrix([[(0.5+0.4)/2, (0.5+0.667)/2, (0.667+0.4)/2]])

To get the final matrix we can use matrix-multiplication for efficiency -
In [84]: mask = matrix!=0
In [100]: (mask.T*marginals).T/mask.sum(0)
Out[100]: matrix([[0.45 , 0.58333334, 0.53333334]])
Or simpler -
In [110]: (marginals.T*mask)/mask.sum(0)
Out[110]: matrix([[0.45 , 0.58333334, 0.53333334]])
If you need that intermediate filled output too, use np.multiply for broadcasted elementwise multiplication -
In [88]: np.multiply(mask,marginals)
Out[88]:
matrix([[0.5 , 0.5 , 0. ],
[0. , 0.66666667, 0.66666667],
[0.4 , 0. , 0.4 ]])

Related

Use numpy array to do conditional operations on another array

Let's say I have 2 arrays:
a = np.array([2, 2, 0, 0, 2, 1, 0, 0, 0, 0, 3, 0, 1, 0, 0, 2])
b = np.array([0, 0.5, 0.25, 0.9])
What I would like to do, is take the value in array b and multiple it to the values in array a, based on it's index.
So the first value in array a is 2. I want the value in array b at that index position to be multiplied by that value. So in array b, index postion 2's value is 0.25, so multiple that value (2) in array a by 0.25.
I know it can be done with iteration, but I'm trying to figure out how it's done elmentwise operations.
Here's the iteration way that I've done:
result = np.array([])
for idx in a:
result = np.append(result, (b[idx] * idx))
To get the result:
print(result)
[0.5 0.5 0. 0. 0.5 0.5 0. 0. 0. 0. 2.7 0. 0.5 0. 0. 0.5]
What's an elementwise equivalent?
Integer arrays can be used as indices in numpy. As a consequence, you can simply do something like this
b[a] * a
EDIT:
Just for completeness, your iterative solution triggers a new memory allocation every time append is called (see the 'returns' section of this page). Since you already now the shape of your output (i.e. a.shape), it's much better to allocate the output array in advance, e.g. result = np.empty(a.shape) and then go through the cycle.
So there are a few ways to do this, but if you want purely element-wise operations you could do the following:
Before getting the result, each element of b is transformed by its index. So create another vector n.
n = np.arange(len(b)) * b
# In the example, n now equals [0. , 0.5, 0.5, 2.7]
# then the result is just n indexed by a
result = n[a]
# result = [0.5, 0.5, 0. , 0. , 0.5, 0.5, 0. , 0. , 0. , 0. , 2.7, 0. , 0.5, 0. , 0. , 0.5]

Implications of manually setting scipy sparse matrix shape

I need to perform online training on a TF-IDF model. I found that scipy's
TfidfVectorizer does not support training on online fashion, so I'm implementing my own CountVectorizer to support online training and then use the scipy's TfidfTransformer to update tf-idf values after a pre-defined number of documents have entered in the corpus.
I found here that you shouldn't be adding rows or columns to numpy arrays since all data would need to be copied so it is stored in contiguous blocks of memory.
But then I also found that in fact, using scipy sparse matrix you can manually change the matrix's shape.
Numpy reshape docs says:
It is not always possible to change the shape of an array without copying the data. If you want an error to be raised when the data is copied, you should assign the new shape to the shape attribute of the array
Since the "reshaping" of the sparse matrix is being done by assigning a new shape, is it safe to say data is not being copied? What are the implications of doing so? Is it efficient?
Code example:
matrix = sparse.random(5, 5, .2, 'csr') # Create (5,5) sparse matrix
matrix._shape = (6, 6) # Change shape to (6, 6)
# Modify data on new empty row
I would also like to expand my question to ask about methods such as vstack that allows one to append arrays to one another (same as adding a row). Is vstack copying the whole data so it gets stored as contiguous blocks of memory as stated in my first link? What about hstack?
EDIT:
So, following this question I've implemented a method to alter the values of a row in a sparse matrix.
Now, mixing the idea of adding new empty rows with the idea of modifying existing values I've come up with the following:
matrix = sparse.random(5, 3, .2, 'csr')
matrix._shape = (6, 3)
# Update indptr to let it know we added a row with nothing in it.
matrix.indptr = np.hstack((matrix.indptr, matrix.indptr[-1]))
# New elements on data, indices format
new_elements = [1, 1]
elements_indices = [0, 2]
# Set elements for new empty row
set_row_csr_unbounded(matrix, 5, new_elements, elements_indices)
I run the above code a few times during the same execution and got no error. But as soon as I try to add a new column (then there would be no need to change indptr) I get an error when I try to alter the values. Any lead on why this happen?
Well, since set_row_csr_unbounded uses numpy.r_ underneath, I assume I'm better using a lil_matrix. Even if all the elements, once added cannot be modified. Am I right?
I think that lil_matrix would be ebtter because I assume numpy.r_ is copying the data.
In numpy reshape means to change the shape in such a way that keeps the same number elements. So the product of the shape terms can't change.
The simplest example is something like
np.arange(12).reshape(3,4)
The assignment method is:
x = np.arange(12)
x.shape = (3,4)
The method (or np.reshape(...)) returns a new array. The shape assignment works in-place.
The docs note that you quote comes into play when doing something like
x = np.arange(12).reshape(3,4).T
x.reshape(3,4) # ok, but copy
x.shape = (3,4) # raises error
To better understand what's happening here, print the array at different stages, and look at how the original 0,1,2,... contiguity changes. (that's left as an exercise for the reader since it isn't central to the bigger question.)
There is a resize function and method, but it isn't used much, and its behavior with respect to views and copies is tricky.
np.concatenate (and variants like np.stack, np.vstack) make new arrays, and copy all the data from the inputs.
A list (and object dtype array) contains pointers to the elements (which may be arrays), and so don't require copying data.
Sparse matrices store their data (and row/col indices) in various attributes that differ among the formats. coo, csr and csc have 3 1d arrays. lil has 2 object arrays containing lists. dok is a dictionary subclass.
lil_matrix implements a reshape method. The other formats do not. As with np.reshape the product of the dimensions can't change.
In theory a sparse matrix could be 'embedded' in a larger matrix with minimal copying of data, since all the new values will be the default 0, and not occupy any space. But the details for that operation have not been worked out for any of the formats.
sparse.hstack and sparse.vstack (don't use the numpy versions on sparse matrices) work by combining the coo attributes of the inputs (via sparse.bmat). So yes, they make new arrays (data, row, col).
A minimal example of making a larger sparse matrix:
In [110]: M = sparse.random(5,5,.2,'coo')
In [111]: M
Out[111]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [112]: M.A
Out[112]:
array([[0. , 0.80957797, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0.23618044, 0. , 0.91625967, 0.8791744 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.7928235 , 0. ]])
In [113]: M1 = sparse.coo_matrix((M.data, (M.row, M.col)),shape=(7,5))
In [114]: M1
Out[114]:
<7x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in COOrdinate format>
In [115]: M1.A
Out[115]:
array([[0. , 0.80957797, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0.23618044, 0. , 0.91625967, 0.8791744 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.7928235 , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ]])
In [116]: id(M1.data)
Out[116]: 139883362735488
In [117]: id(M.data)
Out[117]: 139883362735488
M and M1 have the same data attribute (same array id). But most operations on these matrices will require a conversion to another format (such as csr for math, or lil for changing values), and will involve copying and modifying the attributes. So this connection between the two matrices will be broken.
When you make a sparse matrix with a function like coo_matrix, and don't provide a shape parameter, it deduces the shape from the provided coordinates. If you provide a shape it uses that. That shape has to be at least as large as the implied shape. With lil (and dok) you can profitably create an 'empty' matrix with a large shape, and then set values iteratively. You don't want to do that with csr. And you can't directly set coo values.
The canonical way of creating sparse matrices is to build the data, row, and col arrays or lists iteratively from various pieces - with list append/extend or array concatenates, and make a coo (or csr) format array from that. So you do all the 'growing' before even creating the matrix.
changing _shape
Make a matrix:
In [140]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [141]: M
Out[141]:
<5x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [142]: M.A
Out[142]:
array([[0, 6, 7],
[0, 0, 6],
[1, 0, 5],
[0, 0, 0],
[0, 6, 0]])
In [144]: M[1,0] = 10
... SparseEfficiencyWarning)
In [145]: M.A
Out[145]:
array([[ 0, 6, 7],
[10, 0, 6],
[ 1, 0, 5],
[ 0, 0, 0],
[ 0, 6, 0]])
your new shape method (make sure the dtype of indptr doesn't change):
In [146]: M._shape = (6,3)
In [147]: newptr = np.hstack((M.indptr,M.indptr[-1]))
In [148]: newptr
Out[148]: array([0, 2, 4, 6, 6, 7, 7], dtype=int32)
In [149]: M.indptr = newptr
In [150]: M
Out[150]:
<6x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [151]: M.A
Out[151]:
array([[ 0, 6, 7],
[10, 0, 6],
[ 1, 0, 5],
[ 0, 0, 0],
[ 0, 6, 0],
[ 0, 0, 0]])
In [152]: M[5,2]=10
... SparseEfficiencyWarning)
In [153]: M.A
Out[153]:
array([[ 0, 6, 7],
[10, 0, 6],
[ 1, 0, 5],
[ 0, 0, 0],
[ 0, 6, 0],
[ 0, 0, 10]])
Adding a column also seems to work:
In [154]: M._shape = (6,4)
In [155]: M
Out[155]:
<6x4 sparse matrix of type '<class 'numpy.int64'>'
with 8 stored elements in Compressed Sparse Row format>
In [156]: M.A
Out[156]:
array([[ 0, 6, 7, 0],
[10, 0, 6, 0],
[ 1, 0, 5, 0],
[ 0, 0, 0, 0],
[ 0, 6, 0, 0],
[ 0, 0, 10, 0]])
In [157]: M[5,0]=10
.... SparseEfficiencyWarning)
In [158]: M[5,3]=10
.... SparseEfficiencyWarning)
In [159]: M
Out[159]:
<6x4 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>
In [160]: M.A
Out[160]:
array([[ 0, 6, 7, 0],
[10, 0, 6, 0],
[ 1, 0, 5, 0],
[ 0, 0, 0, 0],
[ 0, 6, 0, 0],
[10, 0, 10, 10]])
attribute sharing
I can make a new matrix from an existing one:
In [108]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [109]: newptr = np.hstack((M.indptr,6))
In [110]: M1 = sparse.csr_matrix((M.data, M.indices, newptr), shape=(6,3))
The data attributes a shared, at least in view sense:
In [113]: M[0,1]=14
In [114]: M1[0,1]
Out[114]: 14
But if I modify M1 by adding a nonzero value:
In [117]: M1[5,0]=10
...
SparseEfficiencyWarning)
The link between the matrices breaks:
In [120]: M[0,1]=3
In [121]: M1[0,1]
Out[121]: 14

New array from existing one, 2 column begin indexes of line/colum from the existing, third being values [duplicate]

This question already has answers here:
Generalise slicing operation in a NumPy array
(4 answers)
Closed 5 years ago.
Here is some code I'm struggling with.
My goal is to create an array (db) from an existing one (t) , in db each line will represent a value of t. db will have 3 column, 1 for line index in t, 1 for column index in t and 1 for the value in t.
In my case, t was a distance matrix, thus diagonal was 0 and it was symetric, I replaced lower triangular values with 0. I don't need 0 values in the new array but I can just delete them in another step.
import numpy as np
t = np.array([[0, 2.5],
[0, 0]])
My goal is to obtain a new array such as :
db = np.array([[0, 0, 0],
[0, 1, 2.5],
[1, 0, 0],
[1, 1, 0]])
Thanks for your time.
You can create a meshgrid of 2D coordinates for the rows and columns, then unroll these into 1D arrays. You can then concatenate these two arrays as well as the unrolled version of t into one final matrix:
import numpy as np
(Y, X) = np.meshgrid(np.arange(t.shape[1]), np.arange(t.shape[0]))
db = np.column_stack((X.ravel(), Y.ravel(), t.ravel()))
Example run
In [9]: import numpy as np
In [10]: t = np.array([[0, 2.5],
...: [0, 0]])
In [11]: (Y, X) = np.meshgrid(np.arange(t.shape[1]), np.arange(t.shape[0]))
In [12]: db = np.column_stack((X.ravel(), Y.ravel(), t.ravel()))
In [13]: db
Out[13]:
array([[ 0. , 0. , 0. ],
[ 0. , 1. , 2.5],
[ 1. , 0. , 0. ],
[ 1. , 1. , 0. ]])

Keep the n highest values of each row of an numpy array and zero everything else [duplicate]

This question already has answers here:
numpy matrix, setting 0 to values by sorting each row
(2 answers)
Closed 5 years ago.
I have a numpy array of data where I need to keep only n highest values, and zero everything else.
My current solution:
import numpy as np
np.random.seed(30)
# keep only the n highest values
n = 3
# Simple 2x5 data field for this example, real life application will be exteremely large
data = np.random.random((2,5))
#[[ 0.64414354 0.38074849 0.66304791 0.16365073 0.96260781]
# [ 0.34666184 0.99175099 0.2350579 0.58569427 0.4066901 ]]
# find indices of the n highest values per row
idx = np.argsort(data)[:,-n:]
#[[0 2 4]
# [4 3 1]]
# put those values back in a blank array
data_ = np.zeros(data.shape) # blank slate
for i in xrange(data.shape[0]):
data_[i,idx[i]] = data[i,idx[i]]
# Each row contains only the 3 highest values per row or the original data
#[[ 0.64414354 0. 0.66304791 0. 0.96260781]
# [ 0. 0.99175099 0. 0.58569427 0.4066901 ]]
In the code above, data_ has the n highest values and everything else is zeroed out. This works out nicely even if data.shape[1] is smaller than n. But the only issue is the for loop, which is slow because my actual use case is on very very large arrays.
Is it possible to get rid of the for loop?
You could act on the result of np.argsort -- np.argsort twice, the first to get the index order and the second to get the ranks -- in a vectorized fashion, and then use either np.where or simply multiplication to zero everything else:
In [116]: np.argsort(data)
Out[116]:
array([[3, 1, 0, 2, 4],
[2, 0, 4, 3, 1]])
In [117]: np.argsort(np.argsort(data)) # these are the ranks
Out[117]:
array([[2, 1, 3, 0, 4],
[1, 4, 0, 3, 2]])
In [118]: np.argsort(np.argsort(data)) >= data.shape[1] - 3
Out[118]:
array([[ True, False, True, False, True],
[False, True, False, True, True]], dtype=bool)
In [119]: data * (np.argsort(np.argsort(data)) >= data.shape[1] - 3)
Out[119]:
array([[ 0.64414354, 0. , 0.66304791, 0. , 0.96260781],
[ 0. , 0.99175099, 0. , 0.58569427, 0.4066901 ]])
In [120]: np.where(np.argsort(np.argsort(data)) >= data.shape[1]-3, data, 0)
Out[120]:
array([[ 0.64414354, 0. , 0.66304791, 0. , 0.96260781],
[ 0. , 0.99175099, 0. , 0.58569427, 0.4066901 ]])

Performing grouped average and standard deviation with NumPy arrays

I have a set of data (X,Y). My independent variable values X are not unique, so there are multiple repeated values, I want to output a new array containing : X_unique, which is a list of unique values of X. Y_mean, the mean of all of the Y values corresponding to X_unique. Y_std, the standard deviation of all the Y values corresponding to X_unique.
x = data[:,0]
y = data[:,1]
You can use binned_statistic from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique would be useful. Putting all those, here's an implementation -
from scipy.stats import binned_statistic as bstat
# Sort data corresponding to argsort of first column
sdata = data[data[:,0].argsort()]
# Unique col-1 elements and positions of breaks (elements are not identical)
unq_x,breaks = np.unique(sdata[:,0],return_index=True)
breaks = np.append(breaks,data.shape[0])
# Use binned statistic to get grouped average and std deviation values
idx_range = np.arange(data.shape[0])
avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks)
std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)
From the docs of binned_statistic, one can also use a custom statistic function :
function : a user-defined function which takes a 1D array of values,
and outputs a single numerical statistic. This function will be called
on the values in each bin. Empty bins will be represented by
function([]), or NaN if this returns an error.
Sample input, output -
In [121]: data
Out[121]:
array([[2, 5],
[2, 2],
[1, 5],
[3, 8],
[0, 8],
[6, 7],
[8, 1],
[2, 5],
[6, 8],
[1, 8]])
In [122]: np.column_stack((unq_x,avg_y,std_y))
Out[122]:
array([[ 0. , 8. , 0. ],
[ 1. , 6.5 , 1.5 ],
[ 2. , 4. , 1.41421356],
[ 3. , 8. , 0. ],
[ 6. , 7.5 , 0.5 ],
[ 8. , 1. , 0. ]])
x_unique = np.unique(x)
y_means = np.array([np.mean(y[x==u]) for u in x_unique])
y_stds = np.array([np.std(y[x==u]) for u in x_unique])
Pandas is done for such task :
data=np.random.randint(1,5,20).reshape(10,2)
import pandas
pandas.DataFrame(data).groupby(0).mean()
gives
1
0
1 2.666667
2 3.000000
3 2.000000
4 1.500000

Categories