Python - split matrix data into separate columns

Python - split matrix data into separate columns - python

I have read data from a file and stored into a matrix (frag_coords):
frag_coords =
[[ 916.0907976 -91.01391344 120.83596334]
[ 916.01117655 -88.73389753 146.912555 ]
[ 924.22832597 -90.51682575 120.81734705]
...
[ 972.55384732 708.71316138 52.24644577]
[ 972.49089559 710.51583744 72.86369124]]
type(frag_coords) =
class 'numpy.matrixlib.defmatrix.matrix'
I do not have any issues when reordering the matrix by a specified column. For example, the code below works just fine:
order = np.argsort(frag_coords[:,2], axis=0)
My issue is that:
len(frag_coords[0]) = 1
I need to access the individual numbers of the first row individually, I've tried splitting it, transforming it into a list and everything seems to return the 3 numbers not as columns but rather as a single element with len=1. I need help please!

Your problem is that you're using a matrix instead of an ndarray. Are you sure you want that?
For a matrix, indexing the first row alone leads to another matrix, a row matrix. Check frag_coords[0].shape: it will be (1,3). For an ndarray, it would be (3,).
If you only need to index the first row, use two indices:
frag_coords[0,j]
Or if you store the row temporarily, just index into it as a row matrix:
tmpvar = frag_coords[0] # shape (1,3)
print(tmpvar[0,2]) # for column 2 of row 0
If you don't need too many matrix operations, I'd advise that you use np.arrays instead. You can always read your data into an array directly, but at a given point you can just transform an existing matrix with np.array(frag_coords) too if you wish.

Related

How to find a specific column from a list of arrays

I have a list of multidimensional arrays and need to calculate the mean for each dimension. I want to extract column[1] data as a list and send it as a parameter to a method in python. Here is my data:
[array([2.33700000e+06, 4.16779479e-01, 9.31000000e-04, 1.99000000e-13,0.00000000e+00, 0.00000000e+00]), array([2.33700000e+06,4.16779479e-01, 9.31000000e-04, 1.99000000e-13,0.00000000e+00, 0.00000000e+00])]
and I want to do some operations for column[1] data like doing an operation on [4.16779479e-01,4.16779479e-01]. How can I do it in python?

You question looks like you are just trying to process the columns of a matrix. If it is an N by M matrix and you have your matrix stored in a variable called my_mat you could do something like
for i in range(len(my_mat[0])):
col = [arr[i] for arr in my_mat]
# process your column.

why a[:,[x]] could create a column vector from an array?

why a[:,[x]] could create a column vector from an array? The [ ] represents what?
Could anyone explain to me the principle?
a = np.random.randn(5,6)
a = a.astype(np.float32)
print(a)
c = torch.from_numpy(a[:,[1]])
[[-1.6919796 0.3160475 0.7606999 0.16881375 1.325092 0.71536326]
[ 1.217861 0.35804042 0.0285245 0.7097111 -2.1760604 0.992101 ]
[-1.6351479 0.6607222 0.9375339 0.5308735 -1.9699149 -2.002803 ]
[-1.1895325 1.1744579 -0.5980689 -0.8906375 -0.00494479 0.51751447]
[-1.7642071 0.4681248 1.3938268 -0.7519176 0.5987852 -0.5138923 ]]
###########################################
tensor([[0.3160],
[0.3580],
[0.6607],
[1.1745],
[0.4681]])

The [ ] mean you are giving extra dimension. Try numpy shape method to see the diference.
a[:,1].shape
output :
(10,)
with [ ]
a[:,[1]].shape
output :
(10,1)

That syntax is for array slicing in numpy, where arrays are indexed as a[rows, columns, page, ... (higher-dimensions)]
Selecting for a specific row/column/page is done by giving a specific number or range of numbers. So when you use a[1,2], numpy gets the element from row 1, column 2.
You can select for several specific indices by giving the dimension multiple values. So a[[1,3],1] gets you both elements (1,1) and (1,3).
The : tells numpy to get everything from that specific array dimension. So when you use a[:,1], numpy gets every row in column 1. Alternatively, a[1,:] gets every column in row 1.

Remove every second row of array multiple times

So I have several .txt files with over +80.000 rows of data.
As such this might not be much for Python, however, I need to use this data in R where I need a certain package. And over there it takes around 30 sec to load one file - and I have 1200 of these files.
However, the data in these files are rather dense. There is no need to have such small steps, i.e. I want to remove some in order to make the file smaller.
What I'm using now is as follows:
np.delete(np.array(data_lines), np.arange(1, np.array(data_lines).size, 2))
I make it start at the row index 1, and the remove every second row of the data_lines array containing the +80.000 lines of data. However, as you can see, this only reduces the rows with 1/2. And I probably need at least a 1/10 reduction. So in principle I could probably just do some kind of loop to do this, but I was wondering if there was a smarter way to achieve it ?

a = np.array(data_lines)[::10]
Takes every tenth row of data. No data is copied, the slicing works with view objects.

You should use slicing. In my example array, the values in each row are identical to the row index (0,1,...,79999). I cut out every 10 rows of my 80000 x 1 np array (the number of columns doesn't matter... this would work on an array with more than 1 column). If you want to slice it differently, here's more info on slicing https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html
import numpy as np
data_lines = np.arange(0,80000).reshape((80000,1))
#
data_lines = data_lines.reshape((80000,1))
data_lines_subset = data_lines[::10]
##data_lines_subset
## array([[ 0],
# [ 10],
# [ 20],
# ...,
# [79970],
# [79980],
# [79990]])
So in your case, if your data_lines array isn't already a np array:
data_lines_subset = np.array(data_lines)[::10]

How do you edit cells in a sparse matrix using scipy?

I'm trying to manipulate some data in a sparse matrix. Once I've created one, how do I add / alter / update values in it? This seems very basic, but I can't find it in the documentation for the sparse matrix classes, or on the web. I think I'm missing something crucial.
This is my failed attempt to do so the same way I would a normal array.
>>> from scipy.sparse import bsr_matrix
>>> A = bsr_matrix((10,10))
>>> A[5][7] = 6
Traceback (most recent call last):
File "<pyshell#11>", line 1, in <module>
A[5][7] = 6
File "C:\Python27\lib\site-packages\scipy\sparse\bsr.py", line 296, in __getitem__
raise NotImplementedError
NotImplementedError

There several Sparse matrix formats. Some are better suited to indexing. One that has implemented it is lil_matrix.
Al = A.tolil()
Al[5,7] = 6 # the normal 2d matrix indexing notation
print Al
print Al.A # aka Al.todense()
A1 = Al.tobsr() # if it must be in bsr format
The documentation for each format suggests what it is good at, and where it is bad. But it does not have a neat list of which ones have which operations defined.
Advantages of the LIL format
supports flexible slicing
changes to the matrix sparsity structure are efficient
...
Intended Usage
LIL is a convenient format for constructing sparse matrices
...
dok_matrix also implements indexing.
The underlying data structure for coo_matrix is easy to understand. It is essentially the parameters for coo_matrix((data, (i, j)), [shape=(M, N)]) definition. To create the same matrix you could use:
sparse.coo_matrix(([6],([5],[7])), shape=(10,10))
If you have more assignments, build larger data, i, j lists (or 1d arrays), and when complete construct the sparse matrix.

The documentation for bsr is here bsr matrix and for csr is here csr matrix. It might be worth it to understand the csr before moving to the bsr. The only difference is that bsr has entries that are matrices themselves whereas the basic unit in a csr is a scalar.
I don't know if there are super easy ways to manipulate the matrices once they are created, but here are some examples of what you're trying to do,
import numpy as np
from scipy.sparse import bsr_matrix, csr_matrix
row = np.array( [5] )
col = np.array( [7] )
data = np.array( [6] )
A = csr_matrix( (data,(row,col)) )
This is a straightforward syntax in which you list all the data you want in the matrix in the array data and then specify where that data should go using row and col. Note that this will make the matrix dimensions just big enough to hold the element in the largest row and column ( in this case a 6x8 matrix ). You can see the matrix in standard form using the todense() method.
A.todense()
However, you cannot manipulate the matrix on the fly using this pattern. What you can do is modify the native scipy representation of the matrix. This involves 3 attributes, indices, indptr, and data. To start with, we can examine the value of these attributes for the array we've already created.
>>> print A.data
array([6])
>>> print A.indices
array([7], dtype=int32)
>>> print A.indptr
array([0, 0, 0, 0, 0, 0, 1], dtype=int32)
data is the same thing it was before, a 1-d array of values we want in the matrix. The difference is that the position of this data is now specified by indices and indptr instead of row and col. indices is fairly straightforward. It simply a list of which column each data entry is in. It will always be the same size and the data array. indptr is a little trickier. It lets the data structure know what row each data entry is in. To quote from the docs,
the column indices for row i are stored in indices[indptr[i]:indptr[i+1]]
From this definition we can see that the size of indptr will always be the number of rows in the matrix + 1. It takes a little while to get used to it, but working through the values for each row will give you some intuition. Note that all the entries are zero until the last one. That means that the column indices for rows i=0-4 are going to be stored in indices[0:0] i.e. the empty array. This is because these rows are all zeros. Finally, on the last row, i=5 we get indices[0:1]=7 which tells us the data entry(ies) data[0:1] are in row 5, column 7.
Now suppose we wanted to add the value 10 at row 2 column 4. We first put it into the data attribute,
A.data = np.array( [10,6] )
next we update indices to indicate the column 10 will be in,
A.indices = np.array( [4,7], dtype=np.int32 )
and finally we indicate which row it will be in by modifying indptr
A.indptr = np.array( [0,0,0,1,1,1,2], dtype=np.int32 )
It is important that you make the data type of indices and indptr np.int32. One way to visualize what's going in in indptr is that the change in numbers occurs as you move from i to i+1 of a row that has data. Also note that arrays like these can be used to construct sparse matrices
B = csr_matrix( (data,indices,indptr) )
It would be nice if it was as easy as simply indexing into the array as you tried, but the implementation is not there yet. That should be enough to get you started at least.

calculating means of many matrices in numpy

I have many csv files which each contain roughly identical matrices. Each matrix is 11 columns by either 5 or 6 rows. The columns are variables and the rows are test conditions. Some of the matrices do not contain data for the last test condition, which is why there are 5 rows in some matrices and six rows in other matrices.
My application is in python 2.6 using numpy and sciepy.
My question is this:
How can I most efficiently create a summary matrix that contains the means of each cell across all of the identical matrices?
The summary matrix would have the same structure as all of the other matrices, except that the value in each cell in the summary matrix would be the mean of the values stored in the identical cell across all of the other matrices. If one matrix does not contain data for the last test condition, I want to make sure that its contents are not treated as zeros when the averaging is done. In other words, I want the means of all the non-zero values.
Can anyone show me a brief, flexible way of organizing this code so that it does everything I want to do with as little code as possible and also remain as flexible as possible in case I want to re-use this later with other data structures?
I know how to pull all the csv files in and how to write output. I just don't know the most efficient way to structure flow of data in the script, including whether to use python arrays or numpy arrays, and how to structure the operations, etc.
I have tried coding this in a number of different ways, but they all seem to be rather code intensive and inflexible if I later want to use this code for other data structures.

You could use masked arrays. Say N is the number of csv files. You can store all your data in a masked array A, of shape (N,11,6).
from numpy import *
A = ma.zeros((N,11,6))
A.mask = zeros_like(A) # fills the mask with zeros: nothing is masked
A.mask = (A.data == 0) # another way of masking: mask all data equal to zero
A.mask[0,0,0] = True # mask a value
A[1,2,3] = 12. # fill a value: like an usual array
Then, the mean values along first axis, and taking into account masked values, are given by:
mean(A, axis=0) # the returned shape is (11,6)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.