Quickly fill large Numpy matrix from Pandas DataFrame

Quickly fill large Numpy matrix from Pandas DataFrame - python

I have DataFrame df with info of x-axes, y-axes, and values to fill numpy matrix mat.
Example of smaller df:
y x x x x value value value value
1 6 3 6 4 100 10 300 15
1 6 2 8 7 50 200 35 70
5 7 5 4 6 2 50 40 400
7 5 3 2 1 105 80 35 44
I want to fill mat = np.zeros(shape=(10,10)) by each y is row index, x is column index with the value at the same position as x in value block. Such as
col=1, row=6, value=100 ###
col=1, row=3, value=10
col=1, row=6, value=300 ###
col=1, row=4, value=10
col=1, row=6, value=50 ###
If more than one value goes into that position (like ###), do average. Is there any ways to go direct from Pandas to matrix (or other quick way)?
What I can do now is using np.ravel of selected column in dataframe first to make 1D-arrays and fill from those arrays but it is slow and redundant a lot.

Construct row and column indices and perform slice assignment.
val = df.values
j = val[:, 0].repeat(4)
i = val[:, 1: 5].ravel()
v = val[:, 5:].ravel()
mat = np.zeros(shape=(10,10), dtype=int)
mat[i, j] = v
mat
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 44, 0, 0],
[ 0, 200, 0, 0, 0, 0, 0, 35, 0, 0],
[ 0, 10, 0, 0, 0, 0, 0, 80, 0, 0],
[ 0, 15, 0, 0, 0, 40, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 50, 0, 105, 0, 0],
[ 0, 50, 0, 0, 0, 400, 0, 0, 0, 0],
[ 0, 70, 0, 0, 0, 2, 0, 0, 0, 0],
[ 0, 35, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
For averages
val = df.values
j = val[:, 0].repeat(4)
i = val[:, 1: 5].ravel()
v = val[:, 5:].ravel()
sums = np.bincount(i * 10 + j, v, 100)
cnts = np.bincount(i * 10 + j, minlength=100)
mask = cnts > 0
sums[mask] /= cnts[mask]
print(sums.reshape(10, 10))
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 44. 0. 0.]
[ 0. 200. 0. 0. 0. 0. 0. 35. 0. 0.]
[ 0. 10. 0. 0. 0. 0. 0. 80. 0. 0.]
[ 0. 15. 0. 0. 0. 40. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 50. 0. 105. 0. 0.]
[ 0. 150. 0. 0. 0. 400. 0. 0. 0. 0.]
[ 0. 70. 0. 0. 0. 2. 0. 0. 0. 0.]
[ 0. 35. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Related

Where is this \n coming from in my arrays (Python)?

I'm trying to create a text string for the following numpy array:
A = array([0, 0, 0, 0, 0.64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])
I can do this easily with this line of code:
text = f'{A}'
The problem I'm having is that whenever I use this f'{}' to create a string from an array, it outputs the same array, but with a \n after some characters:
text
'[0. 0. 0. 0.64 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n 1. 0. 0. 0. ]'
I'm trying to use this array in the title of a plot, so I don't want the array to be text wrapping onto a new line because it makes it confusing to read/see.
I've tried using rstrip('\n') on text but it doesn't remove the '\n'. Does anyone have any idea what's going on? Why is this \n popping up in the string array?

You don't need to declare array() to accomplish what you are trying to do:
A = [0, 0, 0, 0, 0.64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
print(A)
[0, 0, 0, 0, 0.64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
print(f'{A}')
[0, 0, 0, 0, 0.64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
A
[0, 0, 0, 0, 0.64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
It seems like you are creating a NumPy N-dimensional array there and then converting it to a string, so it is printing the string representation of that array when you call print(). Unless you specifically need a NumPy array, you can do it just like I have above, or if you need to:
from numpy import array
A = [0, 0, 0, 0, 0.64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
B = array(A)
print(B)
[0. 0. 0. 0. 0.64 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 1. 0. 0. 0. ]
print(f'{B}')
[0. 0. 0. 0. 0.64 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 1. 0. 0. 0. ]
B
array([0. , 0. , 0. , 0. , 0.64, 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. ])
If you absolutely have to render a NumPy array as a string, then you can do something like this:
text = f'{A}'
text = text.replace("\n","")
Or as Ramón Márquez also mentioned, you can simply increase the printoptions line width:
numpy.set_printoptions(linewidth=96)
Documentation on NumPy arrays: https://machinelearningmastery.com/gentle-introduction-n-dimensional-arrays-python-numpy/
Documentation on NumPy print options: https://numpy.org/doc/1.18/reference/generated/numpy.printoptions.html

It has to do with the way numpy can be configured to print out arrays.
If you set the linewidth to 96 —the length of str(A) without any \n plus 1—, it won't insert line breaks:
numpy.set_printoptions(linewidth=96)

how to generate a modified version of identity matrix in python

I want to generate a modified version of the identity matrix, call it C, such that Cii is zero until some index i, the rest is still 1.
I can use brute force to set Cii to 0, but I think that is not good.
Is there any efficient functions I can use, this is hard to search.
Example below:
the original identity matrix for 3 * 3 is
1 0 0
0 1 0
0 0 1
, I want to change this into:
0 0 0
0 1 0
0 0 1
so the i is 0 in this case, want to change Ckk, k goes from [0, i] to 0.

np.diag makes a 2d array from a 1d diagonal:
In [97]: np.diag((np.arange(6)>2).astype(int))
Out[97]:
array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1]])
basically the same as PPanzer's, but generating the diagonal a different way. Similar speed.

Here is one possibility:
N = 5
k = 2
np.diag(np.bincount([k],None,N).cumsum())
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]])
Update: fast solution:
out = np.zeros((N,N))
out.reshape(-1)[(N+1)*k::N+1] = 1

You can build an NxN identity matrix and assign zero to the top left KxK corner:
N,K = 10,3
im = np.identity(N)
im[:K,:K] = 0
print(im)
output:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
40% faster than hpaulj's but not as fast at Paul Panzer's fast solution (which is 3x faster than this)

Getting the singular values of np.linalg.svd as a matrix

Given a 5x4 matrix A =
A piece of python code to construct the matrix
A = np.array([[1, 0, 0, 0],
[0, 0, 0, 4],
[0, 3, 0, 0],
[0, 0, 0, 0],
[2, 0, 0, 0]])
wolframalpha gives the svd result
the Vector(s) with the singular values Σ is in this form
the equivalent quantity (NumPy call it s) in the output of np.linalg.svd is in this form
[ 4. 3. 2.23606798 -0. ]
is there a way to have the quantity in output of numpy.linalg.svd shown as wolframalpha?

You can get most of the way there with diag:
>>> u, s, vh = np.linalg.svd(a)
>>> np.diag(s)
array([[ 4. , 0. , 0. , 0. ],
[ 0. , 3. , 0. , 0. ],
[ 0. , 0. , 2.23606798, 0. ],
[ 0. , 0. , 0. , -0. ]])
Note that wolfram alpha is giving an extra row. Getting that is marginally more involved:
>>> sigma = np.zeros(A.shape, s.dtype)
>>> np.fill_diagonal(sigma, s)
>>> sigma
array([[ 4. , 0. , 0. , 0. ],
[ 0. , 3. , 0. , 0. ],
[ 0. , 0. , 2.23606798, 0. ],
[ 0. , 0. , 0. , -0. ],
[ 0. , 0. , 0. , 0. ]])
Depending on what your goal is, removing a column from U might be a better approach than adding a row of zeros to sigma. That would look like:
>>> u, s, vh = np.linalg.svd(a, full_matrices=False)

Add 2-d array to 3-d array with constantly changing index fast

I'm trying to add a 2-d array to a 3-d array with constantly changing index , I come up with following code:
import numpy as np
a = np.zeros([8, 3, 5])
k = 0
for i in range(2):
for j in range(4):
a[k, i: i + 2, j: j + 2] += np.ones([2, 2], dtype=int)
k += 1
print(a)
which will give exactly what i want:
[[[1. 1. 0. 0. 0.]
[1. 1. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
[[0. 1. 1. 0. 0.]
[0. 1. 1. 0. 0.]
[0. 0. 0. 0. 0.]]
[[0. 0. 1. 1. 0.]
[0. 0. 1. 1. 0.]
[0. 0. 0. 0. 0.]]
[[0. 0. 0. 1. 1.]
[0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.]
[1. 1. 0. 0. 0.]
[1. 1. 0. 0. 0.]]
[[0. 0. 0. 0. 0.]
[0. 1. 1. 0. 0.]
[0. 1. 1. 0. 0.]]
[[0. 0. 0. 0. 0.]
[0. 0. 1. 1. 0.]
[0. 0. 1. 1. 0.]]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 1. 1.]
[0. 0. 0. 1. 1.]]]
I wish it can be faster so I create an array for index and trying to use np.vectorize. But as manual described, vectorize is not for performance. And my goal is running through an array with shape of (10^6, 15, 15) which end up with 10^6 iteration. I hope there are some cleaner solution can get rid of all the for-loop.
This is the first time I using stack overflow, any suggestion are appreciated.
Thank you.

A efficient solution using numpy.lib.stride_tricks, which can "view" all the possibilities.
N=4 #tray size #(square)
P=3 # chunk size
R=N-P
from numpy.lib.stride_tricks import as_strided
tray = zeros((N,N),numpy.int32)
chunk = ones((P,P),numpy.int32)
tray[R:,R:] = chunk
tray = np.vstack((tray,tray))
view = as_strided(tray,shape=(R+1,R+1,N,N),strides=(4*N,4,4*N,4))
a_view = view.reshape(-1,N,N)
a_hard = a_view.copy()
Here is the result :
In [3]: a_view
Out[3]:
array([[[0, 0, 0, 0],
[0, 1, 1, 1],
[0, 1, 1, 1],
[0, 1, 1, 1]],
[[0, 0, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 0]],
[[0, 1, 1, 1],
[0, 1, 1, 1],
[0, 1, 1, 1],
[0, 0, 0, 0]],
[[1, 1, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 0],
[0, 0, 0, 0]]])
a_view is just a view on possible positions of a chunk on the tray. It doesn't cost any computation, and it just uses twice the tray space.
a_hard is a hard copy, necessary if you need to modify it.

scipy sparse matrix division

I have been trying to divide a python scipy sparse matrix by a vector sum of its rows. Here is my code
sparse_mat = bsr_matrix((l_data, (l_row, l_col)), dtype=float)
sparse_mat = sparse_mat / (sparse_mat.sum(axis = 1)[:,None])
However, it throws an error no matter how I try it
sparse_mat = sparse_mat / (sparse_mat.sum(axis = 1)[:,None])
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 381, in __div__
return self.__truediv__(other)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 427, in __truediv__
raise NotImplementedError
NotImplementedError
Anyone with an idea of where I am going wrong?

You can circumvent the problem by creating a sparse diagonal matrix from the reciprocals of your row sums and then multiplying it with your matrix. In the product the diagonal matrix goes left and your matrix goes right.
Example:
>>> a
array([[0, 9, 0, 0, 1, 0],
[2, 0, 5, 0, 0, 9],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 9, 5, 3, 0, 7],
[1, 0, 0, 8, 9, 0]])
>>> b = sparse.bsr_matrix(a)
>>>
>>> c = sparse.diags(1/b.sum(axis=1).A.ravel())
>>> # on older scipy versions the offsets parameter (default 0)
... # is a required argument, thus
... # c = sparse.diags(1/b.sum(axis=1).A.ravel(), 0)
...
>>> a/a.sum(axis=1, keepdims=True)
array([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0. , 1. , 0. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.375 , 0.20833333, 0.125 , 0. , 0.29166667],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
>>> (c # b).todense() # on Python < 3.5 replace c # b with c.dot(b)
matrix([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0. , 1. , 0. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.375 , 0.20833333, 0.125 , 0. , 0.29166667],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])

Something funny is going on. I have no problem performing the element division. I wonder if it's a Py2 issue. I'm using Py3.
In [1022]: A=sparse.bsr_matrix([[2,4],[1,2]])
In [1023]: A
Out[1023]:
<2x2 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements (blocksize = 2x2) in Block Sparse Row format>
In [1024]: A.A
Out[1024]:
array([[2, 4],
[1, 2]], dtype=int32)
In [1025]: A.sum(axis=1)
Out[1025]:
matrix([[6],
[3]], dtype=int32)
In [1026]: A/A.sum(axis=1)
Out[1026]:
matrix([[ 0.33333333, 0.66666667],
[ 0.33333333, 0.66666667]])
or to try the other example:
In [1027]: b=sparse.bsr_matrix([[0, 9, 0, 0, 1, 0],
...: [2, 0, 5, 0, 0, 9],
...: [0, 2, 0, 0, 0, 0],
...: [2, 0, 0, 0, 0, 0],
...: [0, 9, 5, 3, 0, 7],
...: [1, 0, 0, 8, 9, 0]])
In [1028]: b
Out[1028]:
<6x6 sparse matrix of type '<class 'numpy.int32'>'
with 14 stored elements (blocksize = 1x1) in Block Sparse Row format>
In [1029]: b.sum(axis=1)
Out[1029]:
matrix([[10],
[16],
[ 2],
[ 2],
[24],
[18]], dtype=int32)
In [1030]: b/b.sum(axis=1)
Out[1030]:
matrix([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
....
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
The result of this sparse/dense is also dense, where as the c*b (c is the sparse diagonal) is sparse.
In [1039]: c*b
Out[1039]:
<6x6 sparse matrix of type '<class 'numpy.float64'>'
with 14 stored elements in Compressed Sparse Row format>
The sparse sum is a dense matrix. It is 2d, so there's no need to expand it dimensions. In fact if I try that I get an error:
In [1031]: A/(A.sum(axis=1)[:,None])
....
ValueError: shape too large to be a matrix.

Per this message, to keep the matrix sparse, you access the data values and use the (nonzero) indices:
sums = np.asarray(A.sum(axis=1)).squeeze() # this is dense
A.data /= sums[A.nonzero()[0]]
If dividing by the nonzero row mean instead of the sum, one can
nnz = A.getnnz(axis=1) # this is also dense
means = sums / nnz
A.data /= means[A.nonzero()[0]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Quickly fill large Numpy matrix from Pandas DataFrame - python

Related

Where is this \n coming from in my arrays (Python)?

how to generate a modified version of identity matrix in python

Getting the singular values of np.linalg.svd as a matrix

Add 2-d array to 3-d array with constantly changing index fast

scipy sparse matrix division

Categories

Resources