Convert row-column-value data to an array - numpy - python

I have data in long format that stores the row#, column# and value as shown below:
ROW COLUMN VALUE
1 1 1
1 3 3
2 1 1
2 2 2
3 1 1
3 2 2
3 3 3
Please note that the certain ROW, COLUMN combinations are missing (for instance there is no value for ROW = 1 and COLUMN = 2). I would like to convert this into a 3 x 3 array like so. The missing row column combination gets filled in by 0:
1 0 3
1 2 0
1 2 3
My initial approach to this problem was to declare an empty 3 x 3 array, read in the three columns as 1d arrays and loop over rows and columns and update the array based on the value array. For small dimensional cases this seems doable, but for higher dimensions this does not seem to be the "Pythonic" way to do it. Has this problem been tackled in some canned function in numpy package? I looked into reshape - but that assumes no missing values.

Once you have the row, column and values in numpy arrays, you can do something like the following. (Note that I've taken the more Pythonic approach of putting the 0-based indices in row and col).
Here's the data, in one-dimensonal arrays:
In [13]: row = np.array([0, 0, 1, 1, 2, 2, 2])
In [14]: col = np.array([0, 2, 0, 1, 0, 1, 2])
In [15]: values = np.array([11, 12, 13, 14, 15, 16, 17])
Create a two-dimensional array to hold the values. I use the maxima from row and col to figure out how big the array should be. You might use some other values if row and col don't necessarily include values in the last row or column.
In [16]: a = np.zeros((row.max()+1, col.max()+1), dtype=values.dtype)
Now fill in the values with this assignment
In [17]: a[row, col] = values
Et voilĂ :
In [18]: a
Out[18]:
array([[11, 0, 12],
[13, 14, 0],
[15, 16, 17]])
Your example is a 3x3 array, but if you will actually have much larger arrays and not a lot of entries, you might consider using a scipy sparse matrix. For example, here's how you can create a "COO" matrix from the same data as above, using the coo_matrix class:
In [25]: from scipy.sparse import coo_matrix
In [26]: c = coo_matrix((values, (row, col)), shape=(row.max()+1, col.max()+1))
In [27]: c
Out[27]:
<3x3 sparse matrix of type '<type 'numpy.int64'>'
with 7 stored elements in COOrdinate format>
In [28]: c.A
Out[28]:
array([[11, 0, 12],
[13, 14, 0],
[15, 16, 17]])

Related

Get values of pandas series from a array of index locations

I have a 2-d array of an index of a pandas series. Would like to create a 2-d array of the values from the pandas series that correspond to the index.
For example:
import pandas as pd
import numpy as np
A = pd.Series(data=[1,2,3,4,5])
idx = np.array([[0,2,3],[2,3,1]])
Would like to return:
B = np.array([[1,3,4],[3,4,2]])
I know I could do this as a loop:
B = np.zeros((2,3))
for i in [0,1]:
B[i,:] = test[idx[i]]
However, in practice need to do this repeatedly so would like to broadcast the index locations directly. Pandas is not necessary, happy to do it all in numpy if easier.
Something like this might work:
A[idx.flatten()].values.reshape(idx.shape)
A[idx] gives a Cannot index with multidimensional key error.
In [190]: A = pd.Series(data=[1,2,3,4,5])
...: idx = np.array([[0,2,3],[2,3,1]])
But the 1d array derived from the Series, can be indexed this way:
In [191]: A.values
Out[191]: array([1, 2, 3, 4, 5])
In [192]: A.values[idx]
Out[192]:
array([[1, 3, 4],
[3, 4, 2]])
numpy has no problems returning an array with a dimension that matches idx.
Indexing the Series like this returns a Series - which by definition is 1d:
In [194]: A[idx.ravel()]
Out[194]:
0 1
2 3
3 4
2 3
3 4
1 2
dtype: int64

Slicing a 2D NumPy Array by all zero rows

This is essentially the 2D array equivalent of slicing a python list into smaller lists at indexes that store a particular value. I'm running a program that extracts a large amount of data out of a CSV file and copies it into a 2D NumPy array. The basic format of these arrays are something like this:
[[0 8 9 10]
[9 9 1 4]
[0 0 0 0]
[1 2 1 4]
[0 0 0 0]
[1 1 1 2]
[39 23 10 1]]
I want to separate my NumPy array along rows that contain all zero values to create a set of smaller 2D arrays. The successful result for the above starting array would be the arrays:
[[0 8 9 10]
[9 9 1 4]]
[[1 2 1 4]]
[[1 1 1 2]
[39 23 10 1]]
I've thought about simply iterating down the array and checking if the row has all zeros but the data I'm handling is substantially large. I have potentially millions of rows of data in the text file and I'm trying to find the most efficient approach as opposed to a loop that could waste computation time. What are your thoughts on what I should do? Is there a better way?
a is your array. You can use any to find all zero rows, remove them, and then use split to split by their indices:
#not_all_zero rows indices
idx = np.flatnonzero(a.any(1))
#all_zero rows indices
idx_zero = np.delete(np.arange(a.shape[0]),idx)
#select not_all_zero rows and split by all_zero row indices
output = np.split(a[idx],idx_zero-np.arange(idx_zero.size))
output:
[array([[ 0, 8, 9, 10],
[ 9, 9, 1, 4]]),
array([[1, 2, 1, 4]]),
array([[ 1, 1, 1, 2],
[39, 23, 10, 1]])]
You can use the np.all function to check for rows which are all zeros, and then index appropriately.
# assume `x` is your data
indices = np.all(x == 0, axis=1)
zeros = x[indices]
nonzeros = x[np.logical_not(indices)]
The all function accepts an axis argument (as do many NumPy functions), which indicates the axis along which to operate. 1 here means to do the reduction along rows, so you get back a boolean array of shape (x.shape[0],), which can be used to directly index x.
Note that this will be much faster than a for-loop over the rows, especially for large arrays.

Count how many columns of a numpy matrix contain all positive values

I want to check how many columns of a numpy array/matrix have only positive values.
I took my matrix and printed A>0 and got True and False and then I tried any and all functions but didn't succeed.
In [55]: a = np.array([[13, 21, 12],
[21, -1, 6],
[ 1, 10, 2],
[41, 1, 4]])
The output should be 2.
I saved the matrix A in B and tried writing:
B.all(axis=1).any()>0
This function counts the number of column whose elements are all greater than 0:
def count(mat):
counter = 0
tmp = mat > 0
for col in tmp.T:
if all(col):
counter += 1
return counter
How does this function work?
First it assigns to tmp a matrix of boolean values indicating if the corresponding value of the original matrix was greater than 0, then it iterates through the transpose of such matrix and checks if all the values are True, meaning they are all greater than 0.
The transpose contains the columns of the original matrix. Whey you create a numpy array you pass the rows to the function. By transposing, the array will contain the columns.

Subtracting minimum of row from the row

I know that
a - a.min(axis=0)
will subtract the minimum of each column from every element in the column. I want to subtract the minimum in each row from every element in the row. I know that
a.min(axis=1)
specifies the minimum within a row, but how do I tell the subtraction to go by rows instead of columns? (How do I specify the axis of the subtraction?)
edit: For my question, a is a 2d array in NumPy.
Assuming a is a numpy array, you can use this:
new_a = a - np.min(a, axis=1)[:,None]
Try it out:
import numpy as np
a = np.arange(24).reshape((4,6))
print (a)
new_a = a - np.min(a, axis=1)[:,None]
print (new_a)
Result:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
[[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]]
Note that np.min(a, axis=1) returns a 1d array of row-wise minimum values.
We than add an extra dimension to it using [:,None]. It then looks like this 2d array:
array([[ 0],
[ 6],
[12],
[18]])
When this 2d array participates in the subtraction, it gets broadcasted into a shape of (4,6), which looks like this:
array([[ 0, 0, 0, 0, 0, 0],
[ 6, 6, 6, 6, 6, 6],
[12, 12, 12, 12, 12, 12],
[18, 18, 18, 18, 18, 18]])
Now, element-wise subtraction happens between the two (4,6) arrays.
Specify keepdims=True to preserve a length-1 dimension in place of the dimension that min collapses, allowing broadcasting to work out naturally:
a - a.min(axis=1, keepdims=True)
This is especially convenient when axis is determined at runtime, but still probably clearer than manually reintroducing the squashed dimension even when the 1 value is fixed.
If you want to use only pandas you can just apply a lambda to every column using min(row)
new_df = pd.DataFrame()
for i, col in enumerate(df.columns):
new_df[col] = df.apply(lambda row: row[i] - min(row))

Find first nonzero column in scipy.sparse matrix

I am looking for the first column containing a nonzero element in a sparse matrix (scipy.sparse.csc_matrix). Actually, the first column starting with the i-th one to contain a nonzero element.
This is part of a certain type of linear equation solver. For dense matrices I had the following: (Relevant line is pcol = ...)
import numpy
D = numpy.matrix([[1,0,0],[2,0,0],[3,0,1]])
i = 1
pcol = i + numpy.argmax(numpy.any(D[:,i:], axis=0))
if pcol != i:
# Pivot columns i, pcol
D[:,[i,pcol]] = D[:,[pcol,i]]
print(D)
# Result should be numpy.matrix([[1,0,0],[2,0,0],[3,1,0]])
The above should swap columns 1 and 2. If we set i = 0 instead, D is unchanged since column 0 already contains nonzero entries.
What is an efficient way to do this for scipy.sparse matrices? Are there analogues for the numpy.any() and numpy.argmax() functions?
With a csc matrix it is easy to find the nonzero columns.
In [302]: arr=sparse.csc_matrix([[0,0,1,2],[0,0,0,2]])
In [303]: arr.A
Out[303]:
array([[0, 0, 1, 2],
[0, 0, 0, 2]])
In [304]: arr.indptr
Out[304]: array([0, 0, 0, 1, 3])
In [305]: np.diff(arr.indptr)
Out[305]: array([0, 0, 1, 2])
The last line shows how many nonzero terms there are in each column.
np.nonzero(np.diff(arr.indptr))[0][0] would be the index of the first nonzero value in that diff.
Do the same on a csr matrix for find the 1st nonzero row.
I can elaborate on indptr if you want.

Categories