numpy, fill sparse matrix with rows from other matrix - python

I have trouble figuring out what would be the most efficient way to do the following:
import numpy as np
M = 10
K = 10
ind = np.array([0,1,0,1,0,0,0,1,0,0])
full = np.random.rand(sum(ind),K)
output = np.zeros((M,K))
output[1,:] = full[0,:]
output[3,:] = full[1,:]
output[7,:] = full[2,:]
I want to build output, which is a sparse matrix, whose rows are given in a dense matrix (full) and the row indices are specified through a binary vector.
Ideally, I want to avoid a for-loop. Is that possible? If not, I'm looking for the most efficient way to for-loop this.
I need to perform this operation quite a few times. ind and full will keep changing, hence I've just provided some exemplar values for illustration.
I expect ind to be pretty sparse (at most 10% ones), and both M and K to be large numbers (10e2 - 10e3). Ultimately, I might need to perform this operation in pytorch, but some decent procedure for numpy, would already get me quite far.
Please also help me find a more appropriate title for the question, if you have one or more appropriate categories for this question.
Many thanks,
Max

output[ind.astype(bool)] = full
By converting the integer values in ind to boolean values, you can do boolean indexing to select the rows in output that you want to populate with values in full.
example with a 4x4 array:
M = 4
K = 4
ind = np.array([0,1,0,1])
full = np.random.rand(sum(ind),K)
output = np.zeros((M,K))
output[ind.astype(bool)] = full
print(output)
[[ 0. 0. 0. 0. ]
[ 0.32434109 0.11970721 0.57156261 0.35839647]
[ 0. 0. 0. 0. ]
[ 0.66038644 0.00725318 0.68902177 0.77145089]]

Related

why a[:,[x]] could create a column vector from an array?

why a[:,[x]] could create a column vector from an array? The [ ] represents what?
Could anyone explain to me the principle?
a = np.random.randn(5,6)
a = a.astype(np.float32)
print(a)
c = torch.from_numpy(a[:,[1]])
[[-1.6919796 0.3160475 0.7606999 0.16881375 1.325092 0.71536326]
[ 1.217861 0.35804042 0.0285245 0.7097111 -2.1760604 0.992101 ]
[-1.6351479 0.6607222 0.9375339 0.5308735 -1.9699149 -2.002803 ]
[-1.1895325 1.1744579 -0.5980689 -0.8906375 -0.00494479 0.51751447]
[-1.7642071 0.4681248 1.3938268 -0.7519176 0.5987852 -0.5138923 ]]
###########################################
tensor([[0.3160],
[0.3580],
[0.6607],
[1.1745],
[0.4681]])
The [ ] mean you are giving extra dimension. Try numpy shape method to see the diference.
a[:,1].shape
output :
(10,)
with [ ]
a[:,[1]].shape
output :
(10,1)
That syntax is for array slicing in numpy, where arrays are indexed as a[rows, columns, page, ... (higher-dimensions)]
Selecting for a specific row/column/page is done by giving a specific number or range of numbers. So when you use a[1,2], numpy gets the element from row 1, column 2.
You can select for several specific indices by giving the dimension multiple values. So a[[1,3],1] gets you both elements (1,1) and (1,3).
The : tells numpy to get everything from that specific array dimension. So when you use a[:,1], numpy gets every row in column 1. Alternatively, a[1,:] gets every column in row 1.

Efficient operation on numpy arrays contain rows with different size

I want to ask something that is related with this question posted time ago Operation on numpy arrays contain rows with different size . The point is that I need to do some operations with numpy arrays that contain rows with different size.
The standard way like "list2*list3*np.exp(list1)" doens't work since the rows are from different size, and the option that works is using zip. See the code below.
import numpy as np
import time
list1 = np.array([[2.,0.,3.5,3],[3.,4.2,5.,7.1,5.]])
list2 = np.array([[2,3,3,0],[3,8,5.1,7.6,1.2]])
list3 = np.array([[1,3,8,3],[3,4,9,0,0]])
start_time = time.time()
c =[]
for i in range(len(list1)):
c.append([list2*list3*np.exp(list1) for list1, list2,list3 in zip(list1[i], list2[i],list3[i])])
print("--- %s seconds ---"% (time.time()-start_time))
I want to ask if exist a much more efficient way to perform this operations avoiding a loop an doing in a more numpy way. Thanks!
This should do it:
f = np.vectorize(lambda x, y, z: y * z * np.exp(x))
result = [f(*i) for i in np.column_stack((list1, list2, list3))]
result
#[array([ 14.7781122 , 9. , 794.77084701, 0. ]),
# array([ 180.76983231, 2133.96259331, 6812.16400281, 0. , 0. ])]

Python: Combining 2D arrays with 1 common column that has different values

I want to combine two arrays which represent a curve where the variable is column 1, however the column 0 values do not always match:
import numpy as np
arr1= np.array([(12,1003),(17,900),(20,810)])
arr2= np.array([(10,1020),(17,902),(19,870),(21,750)])
I want to combine these into one array where the column 0 is combined and both column 1s are stacked with gaps where there is no value for the corresponding column 0 value, something like this:
arr3=np.array([((10,None,1020),(12,1003,None),(17,900,902),(19,None,870),(20,810,None),(21,None,750))])
The reason for this is that I want to be able to get mean values of the second column for each array but they are not at exactly the same column 0 value so the idea of creating this array is to then interpolate to replace all the None values, then create mean values from column 1 and 2 and have an extra column to represent that.
I have used numPy for everything else so far but obviously have got stuck with the np.column_stack function as it needs lists of the same length and also will be blind to stacking based on values from column o. Lastly I do not want to create a fit for the data as the actual data is non-linear and possibily not consistent so a fit will not work and interpolation seems like the most accurate method.
There may be an answer already but due to me not knowing how to describe it well I can't find it. Also I am relatively new to python so please don't make any assumptions about my knowledge other than it is very little.
Thank you.
will this help ??
import pandas
import numpy as np
arr1= np.array([(12,1003),(17,900),(20,810)])
arr2= np.array([(10,1020),(17,902),(19,870),(21,750)])
d1 = pandas.DataFrame(arr1)
d2 = pandas.DataFrame(arr2)
d1.columns = d2.columns = ['t','v']
d3 = pandas.DataFrame(np.array(d1.merge(d2, on='t',how='outer')))
print d3.values
# use d3.as_matrix() to convert to numpy array
output
[[ 12. 1003. nan]
[ 17. 900. 902.]
[ 20. 810. nan]
[ 10. nan 1020.]
[ 19. nan 870.]
[ 21. nan 750.]]

Python - split matrix data into separate columns

I have read data from a file and stored into a matrix (frag_coords):
frag_coords =
[[ 916.0907976 -91.01391344 120.83596334]
[ 916.01117655 -88.73389753 146.912555 ]
[ 924.22832597 -90.51682575 120.81734705]
...
[ 972.55384732 708.71316138 52.24644577]
[ 972.49089559 710.51583744 72.86369124]]
type(frag_coords) =
class 'numpy.matrixlib.defmatrix.matrix'
I do not have any issues when reordering the matrix by a specified column. For example, the code below works just fine:
order = np.argsort(frag_coords[:,2], axis=0)
My issue is that:
len(frag_coords[0]) = 1
I need to access the individual numbers of the first row individually, I've tried splitting it, transforming it into a list and everything seems to return the 3 numbers not as columns but rather as a single element with len=1. I need help please!
Your problem is that you're using a matrix instead of an ndarray. Are you sure you want that?
For a matrix, indexing the first row alone leads to another matrix, a row matrix. Check frag_coords[0].shape: it will be (1,3). For an ndarray, it would be (3,).
If you only need to index the first row, use two indices:
frag_coords[0,j]
Or if you store the row temporarily, just index into it as a row matrix:
tmpvar = frag_coords[0] # shape (1,3)
print(tmpvar[0,2]) # for column 2 of row 0
If you don't need too many matrix operations, I'd advise that you use np.arrays instead. You can always read your data into an array directly, but at a given point you can just transform an existing matrix with np.array(frag_coords) too if you wish.

Composite a numpy array/matrix based on column values and variables?

I'm playing with NumPy and Scipy and I'm having trouble finding a feature in the documentation. I was thus wondering if anyone could help.
Suppose I have an array in NumPy with two columns and k rows. One column serves as an numerical indicator (e.g. 2 = male, 1 = female, 0 = unknown) while the second column is perhaps a list of values or scores.
Lets say that I want to find the standard deviation (could be mean or whatever, I just want to apply a function) of the values for all rows with indicator 0, and then for 1, and finally, 2.
Is there a predefined function to composite this for me?
In R, the equivalent can be found in the plyr package. Does NumPy and/or Scipy have an equivalent, or am I stuck creating a mask for this array and then somehow filtering through this mask and then applying my function?
As always, thanks for your help!
If I understand your description, you have a dataset something like this:
In [79]: x=np.random.randint(0,3,size=100)
In [80]: y=np.random.randint(0,100,size=100)
In [81]: d=np.vstack([x,y]).T
In [88]: print d[:5,:]
[[ 0 43]
[ 1 60]
[ 2 60]
[ 1 4]
[ 0 30]]
In this situation numpy.unique can be used to generate an array of unique "key" values:
In [82]: idx=np.unique(d[:,0])
In [83]: print idx
[0 1 2]
and those values used to drive a generator expression like this:
[113]: g=(d[np.where(d[:,0]==val),1].std() for val in idx)
The generator g will emit the standard deviation of all the entries in d which match each entry in the index. numpy.fromiterator can then be used to collect the results:
In [114]: print np.vstack([idx,np.fromiter(g,dtype=np.float)]).T
[[ 0. 26.87376385]
[ 1. 29.41046084]
[ 2. 24.2477246 ]]
Note there is conversion of the keys to floating point in the last step during stacking, you might not want that depending on your data, but I did just it for illustrative purposes to have a "nice" looking final result to post.
You can use masked array operations for that.
http://docs.scipy.org/doc/numpy/reference/maskedarray.html#maskedarray
To create the mask, you can use the numpy.where function, like so:
male_mask = numpy.where(a[:,0]==2, False, True)
female_mask = numpy.where(a[:,0]==1, False, True)
Then, remember to use the special functions from numpy.ma:
http://docs.scipy.org/doc/numpy/reference/routines.ma.html
male_average = numpy.ma.average(ma.array(a[:,1], mask=male_mask))
EDIT: actually, this works just as well:
numpy.ma.average(ma.array(a[:,1], mask=a[:,0]!=value))

Categories