initialize a matrix in python of the row number - python

Is there a way to initialize a 3 row, 5 column matrix which contains these values without using a for loop?
[[0 0 0 0 0
1 1 1 1 1
2 2 2 2 2]]

It's possible.
i = 0
matrix = []
while i <=2:
matrix += [[i]*5]
i += 1

Without any for loops or list comprehensions, you can use a combination of built-in functions:
map(list, zip(*[range(3)] * 5))

If you're dealing with large datasets and are worried about performance, you might want to consider putting your data into a two-dimensional NumPy array. Here are a couple of ways of generating the matrix you ask for in NumPy:
>>> import numpy as np
>>> np.indices((3, 5))[0]
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
>>> np.repeat(np.arange(3), 5).reshape((3, 5))
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
The first of these is simpler, but a little bit wasteful: the np.indices call actually generates the array you want (which could be regarded as an array of row indices) along with a companion array of column indices:
>>> np.indices((3, 5))[1]
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
with both arrays packed conveniently into a single array of shape (2, 3, 5). If you need that second array anyway for what you're doing then np.indices is the way to go (though in that case you may also want to look into NumPy's mgrid, ogrid and meshgrid functions). The second solution with np.repeat only generates the values you need, and not surprisingly, finishes about twice as fast on my machine when I bump the size of the matrix up to (3000, 5000):
In [19]: %timeit np.indices((3000, 5000))[0]
10 loops, best of 3: 156 ms per loop
In [20]: %timeit np.repeat(np.arange(3000), 5000).reshape((3000, 5000))
10 loops, best of 3: 88.4 ms per loop
Having said that, using np.repeat in this way is a little bit of an antipattern in NumPy: it's often better to avoid the repetition by creating a 2d array with 3 rows and a single column, and rely on NumPy's broadcasting to interpret this correctly when it's combined with other arrays. If you go that way, all you need is:
>>> np.arange(3)[:, np.newaxis]
array([[0],
[1],
[2]])
This is an array of shape (3, 1); a subsequent operation with an array of shape (5,) or (1, 5) (for example) would be subject to NumPy's broadcasting rules, producing an output of shape (3, 5). For example, here's what happens when we add a 1d array of zeros to the above:
>>> np.arange(3)[:, np.newaxis] + np.zeros(5, dtype=int)
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
And for completeness, here's one more variation, using np.tile:
>>> np.tile(np.arange(3)[:, np.newaxis], (1, 5))
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
All of these solutions should have reasonably similar performance for large values of 3 and 5; if this is a bottleneck, you'll want to do timings on your machine to decide which to use. On my machine, the +np.zeros broadcasting solution beats the others by some margin.

This is one of easy way to understand for Python Beginner.
matrix = []
for data in range(3):
matrix.append([data] * 5)

This is possible using:
[[data] * 5 for data in range(3)]

Related

Select random non zero elements from each row of a 2d numpy array

I have a 2d array
a = array([[5, 0, 1, 0],
[0, 1, 3, 5],
[2, 3, 0, 0],
[4, 0, 2, 4],
[3, 2, 0, 3]])
and a 1d array
b = array([1, 2, 1, 2, 2])
which (b) tells how many non-zero elements we want to choose from each row of the array a.
For example, b[0] = 1 tells us that we have to choose 1 non-zero element from a[0], b[1] = 2 tells us that we have to choose 2 non-zero elements from a[1], and so on.
For a 1d array, it can be done using np.random.choice, but I can't find how to do it for a 2d array, so I have to use a for loop which slows the computation.
I want the result as 2d array as
array([[5, 0, 0, 0],
[0, 1, 0, 5],
[2, 0, 0, 0],
[0, 0, 2, 4],
[3, 2, 0, 0]])
Here, we have 1 element in row 1, 2 elements in row 2, 1 element in row 3 and so on as given in array b.
It looks like a Competitive Programming problem.
I don't think that you can achieve the results using numpy.random.choice (I may be wrong).
Anyways, think of it like this. To select x number of non-zero elements from a 1D array of size n, it will be of O(n) complexity in the worst case. And, for a 2D array it will be O(n^2) if you follow the same naive approach.
this post is almost similar to your question, but numpy.nonzero also is an O(n^2) function.

vectorizing numpy bincount

I have a 2d numpy array., A I want to apply np.bincount() to each column of the matrix A to generate another 2d array B that is composed of the bincounts of each column of the original matrix A.
My problem is that np.bincount() is a function that takes a 1d array-like. It's not an array method like B = A.max(axis=1) for example.
Is there a more pythonic/numpythic way to generate this B array other than a nasty for-loop?
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
for x in range(A.shape[1]):
B[:,x] = np.bincount(A[:,x])
Using the same philosophy as in this post, here's a vectorized approach -
m = A.shape[1]
n = A.max()+1
A1 = A + (n*np.arange(m))
out = np.bincount(A1.ravel(),minlength=n*m).reshape(m,-1).T
I would suggest to use np.apply_along_axis, which will allow you to apply a 1D-method (in this case np.bincount) to 1D slices of a higher dimensional array:
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
B = np.apply_along_axis(np.bincount, axis=0, arr=A)
You'll have to be careful, though. This (as well as your suggested for-loop) only works if the output of np.bincount has the right shape. If the maximum state is not present in one or multiple columns of your array A, the output will not have a smaller dimensionality and thus, the code will file with a ValueError.
This solution using the numpy_indexed package (disclaimer: I am its author) is fully vectorized, thus does not include any python loops behind the scenes. Also, there are no restrictions on the input; not every column needs to contain the same set of unique values.
import numpy_indexed as npi
rowidx, colidx = np.indices(A.shape)
(bin, col), B = npi.count_table(A.flatten(), colidx.flatten())
This gives an alternative (sparse) representation of the same result, which may be much more appropriate if the B array does indeed contain many zeros:
(bin, col), count = npi.count((A.flatten(), colidx.flatten()))
Note that apply_along_axis is just syntactic sugar for a for-loop, and has the same performance characteristics.
Yet another possibility:
import numpy as np
def bincount_columns(x, minlength=None):
nbins = x.max() + 1
if minlength is not None:
nbins = max(nbins, minlength)
ncols = x.shape[1]
count = np.zeros((nbins, ncols), dtype=int)
colidx = np.arange(ncols)[None, :]
np.add.at(count, (x, colidx), 1)
return count
For example,
In [110]: x
Out[110]:
array([[4, 2, 2, 3],
[4, 3, 4, 4],
[4, 3, 4, 4],
[0, 2, 4, 0],
[4, 1, 2, 1],
[4, 2, 4, 3]])
In [111]: bincount_columns(x)
Out[111]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2]])
In [112]: bincount_columns(x, minlength=7)
Out[112]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2],
[0, 0, 0, 0],
[0, 0, 0, 0]])

Ignoring duplicate entries in sparse matrix

I've tried to initialize csc_matrix and csr_matrix from a list of (data, (rows, cols)) values as the documentation suggests.
sparse = csc_matrix((data, (rows, cols)), shape=(n, n))
The problem is that, the method that I actually have for generating the data, rows and cols vectors introduces duplicates for some points. By default, scipy adds the values of the duplicate entries. However, in my case, those duplicates have exactly the same value in data for a given (row, col).
What I'm trying to achieve is to make scipy ignore the second entry if already exists one, instead of adding them.
Ignoring the fact that I could improve the generation algorithm to avoid generating duplicates, is there a parameter or another way of creating a sparse matrix that ignores duplicates?
Currently two entries with data = [4, 4]; cols = [1, 1]; rows = [1, 1]; generate a sparse matrix which value at (1,1) is 8 while the desired value is 4.
>>> c = csc_matrix(([4, 4], ([1,1],[1,1])), shape=(3,3))
>>> c.todense()
matrix([[0, 0, 0],
[0, 8, 0],
[0, 0, 0]])
I'm also aware that I could filter them by using a 2-dimensional numpy unique function, but lists are quite large so this is not really a valid option.
Other possible answer to the question: Is there any way of specifying what to do with duplicates? i.e. keeping the min or max instead of the default sum?
Creating an intermediary dok matrix works in your example:
In [410]: c=sparse.coo_matrix((data, (cols, rows)),shape=(3,3)).todok().tocsc()
In [411]: c.A
Out[411]:
array([[0, 0, 0],
[0, 4, 0],
[0, 0, 0]], dtype=int32)
A coo matrix puts your input arrays into its data,col,row attributes without change. The summing doesn't occur until it is converted to a csc.
todok loads the dictionary directly from the coo attributes. It creates the blank dok matrix, and fills it with:
dok.update(izip(izip(self.row,self.col),self.data))
So if there are duplicate (row,col) values, it's the last one that remains. This uses the standard Python dictionary hashing to find the unique keys.
Here's a way of using np.unique. I had to construct a special object array, because unique operates on 1d, and we have a 2d indexing.
In [479]: data, cols, rows = [np.array(j) for j in [[1,4,2,4,1],[0,1,1,1,2],[0,1,2,1,1]]]
In [480]: x=np.zeros(cols.shape,dtype=object)
In [481]: x[:]=list(zip(rows,cols))
In [482]: x
Out[482]: array([(0, 0), (1, 1), (2, 1), (1, 1), (1, 2)], dtype=object)
In [483]: i=np.unique(x,return_index=True)[1]
In [484]: i
Out[484]: array([0, 1, 4, 2], dtype=int32)
In [485]: c1=sparse.csc_matrix((data[i],(cols[i],rows[i])),shape=(3,3))
In [486]: c1.A
Out[486]:
array([[1, 0, 0],
[0, 4, 2],
[0, 1, 0]], dtype=int32)
I have no idea which approach is faster.
An alternative way of getting the unique index, as per liuengo's link:
rc = np.vstack([rows,cols]).T.copy()
dt = rc.dtype.descr * 2
i = np.unique(rc.view(dt), return_index=True)[1]
rc has to own its own data in order to change the dtype with view, hence the .T.copy().
In [554]: rc.view(dt)
Out[554]:
array([[(0, 0)],
[(1, 1)],
[(2, 1)],
[(1, 1)],
[(1, 2)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
Since the values in your data at repeating (row, col) are the same, you can get the unique rows, columns and values as follows:
rows, cols, data = zip(*set(zip(rows, cols, data)))
Example:
data = [4, 3, 4]
cols = [1, 2, 1]
rows = [1, 3, 1]
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()
matrix([[0, 0, 0, 0],
[0, 8, 0, 0],
[0, 0, 0, 0],
[0, 0, 3, 0]])
rows, cols, data = zip(*set(zip(rows, cols, data)))
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()
matrix([[0, 0, 0, 0],
[0, 4, 0, 0],
[0, 0, 0, 0],
[0, 0, 3, 0]])
Just to update hpaulj's answer to the most recent version of SciPy, the simplest solution to this problem is now, given a COO matrix c now:
dok=sparse.dok_matrix((c.shape),dtype=c.dtype)
dok._update(zip(zip(c.row,c.col),c.data))
new_c = dok.tocsc()
This is due to a new wrapper in the dok update() function, preventing it from direct changes to the array, requiring the use of the underscore to bypass the wrapper.

How does the axis parameter from NumPy work?

Can someone explain exactly what the axis parameter in NumPy does?
I am terribly confused.
I'm trying to use the function myArray.sum(axis=num)
At first I thought if the array is itself 3 dimensions, axis=0 will return three elements, consisting of the sum of all nested items in that same position. If each dimension contained five dimensions, I expected axis=1 to return a result of five items, and so on.
However this is not the case, and the documentation does not do a good job helping me out (they use a 3x3x3 array so it's hard to tell what's happening)
Here's what I did:
>>> e
array([[[1, 0],
[0, 0]],
[[1, 1],
[1, 0]],
[[1, 0],
[0, 1]]])
>>> e.sum(axis = 0)
array([[3, 1],
[1, 1]])
>>> e.sum(axis=1)
array([[1, 0],
[2, 1],
[1, 1]])
>>> e.sum(axis=2)
array([[1, 0],
[2, 1],
[1, 1]])
>>>
Clearly the result is not intuitive.
Clearly,
e.shape == (3, 2, 2)
Sum over an axis is a reduction operation so the specified axis disappears. Hence,
e.sum(axis=0).shape == (2, 2)
e.sum(axis=1).shape == (3, 2)
e.sum(axis=2).shape == (3, 2)
Intuitively, we are "squashing" the array along the chosen axis, and summing the numbers that get squashed together.
To understand the axis intuitively, refer the picture below (source: Physics Dept, Cornell Uni)
The shape of the (boolean) array in the above figure is shape=(8, 3). ndarray.shape will return a tuple where the entries correspond to the length of the particular dimension. In our example, 8 corresponds to length of axis 0 whereas 3 corresponds to length of axis 1.
If someone need this visual description:
There are good answers for visualization however it might help to think purely from analytical perspective.
You can create array of arbitrary dimension with numpy.
For example, here's a 5-dimension array:
>>> a = np.random.rand(2, 3, 4, 5, 6)
>>> a.shape
(2, 3, 4, 5, 6)
You can access any element of this array by specifying indices. For example, here's the first element of this array:
>>> a[0, 0, 0, 0, 0]
0.0038908603263844155
Now if you take out one of the dimensions, you get number of elements in that dimension:
>>> a[0, 0, :, 0, 0]
array([0.00389086, 0.27394775, 0.26565889, 0.62125279])
When you apply a function like sum with axis parameter, that dimension gets eliminated and array of dimension less than original gets created. For each cell in new array, the operator will get list of elements and apply the reduction function to get a scaler.
>>> np.sum(a, axis=2).shape
(2, 3, 5, 6)
Now you can check that the first element of this array is sum of above elements:
>>> np.sum(a, axis=2)[0, 0, 0, 0]
1.1647502999560164
>>> a[0, 0, :, 0, 0].sum()
1.1647502999560164
The axis=None has special meaning to flatten out the array and apply function on all numbers.
Now you can think about more complex cases where axis is not just number but a tuple:
>>> np.sum(a, axis=(2,3)).shape
(2, 3, 6)
Note that we use same technique to figure out how this reduction was done:
>>> np.sum(a, axis=(2,3))[0,0,0]
7.889432081931909
>>> a[0, 0, :, :, 0].sum()
7.88943208193191
You can also use same reasoning for adding dimension in array instead of reducing dimension:
>>> x = np.random.rand(3, 4)
>>> y = np.random.rand(3, 4)
# New dimension is created on specified axis
>>> np.stack([x, y], axis=2).shape
(3, 4, 2)
>>> np.stack([x, y], axis=0).shape
(2, 3, 4)
# To retrieve item i in stack set i in that axis
Hope this gives you generic and full understanding of this important parameter.
Some answers are too specific or do not address the main source of confusion. This answer attempts to provide a more general but simple explanation of the concept, with a simple example.
The main source of confusion is related to expressions such as "Axis along which the means are computed", which is the documentation of the argument axis of the numpy.mean function. What the heck does "along which" even mean here? "Along which" essentially means that you will sum the rows (and divide by the number of rows, given that we are computing the mean), if the axis is 0, and the columns, if the axis is 1. In the case of axis is 0 (or 1), the rows can be scalars or vectors or even other multi-dimensional arrays.
In [1]: import numpy as np
In [2]: a=np.array([[1, 2], [3, 4]])
In [3]: a
Out[3]:
array([[1, 2],
[3, 4]])
In [4]: np.mean(a, axis=0)
Out[4]: array([2., 3.])
In [5]: np.mean(a, axis=1)
Out[5]: array([1.5, 3.5])
So, in the example above, np.mean(a, axis=0) returns array([2., 3.]) because (1 + 3)/2 = 2 and (2 + 4)/2 = 3. It returns an array of two numbers because it returns the mean of the rows for each column (and there are two columns).
Both 1st and 2nd reply is great for understanding ndarray concept in numpy. I am giving a simple example.
And according to this image by #debaonline4u
https://i.stack.imgur.com/O5hBF.jpg
Suppose , you have an 2D array -
[1, 2, 3]
[4, 5, 6]
In, numpy format it will be -
c = np.array([[1, 2, 3],
[4, 5, 6]])
Now,
c.ndim = 2 (rows/axis=0)
c.shape = (2,3) (axis0, axis1)
c.sum(axis=0) = [1+4, 2+5, 3+6] = [5, 7, 9] (sum of the 1st elements of each rows, so along axis0)
c.sum(axis=1) = [1+2+3, 4+5+6] = [6, 15] (sum of the elements in a row, so along axis1)
So for your 3D array,

Increment given indices in a matrix

Briefly: there is a similar question and the best answer suggests using numpy.bincount. I need the same thing, but for a matrix.
I've got two arrays:
array([1, 2, 1, 1, 2])
array([2, 1, 1, 1, 1])
together they make indices that should be incremented:
>>> np.array([a, b]).T
array([[1, 2],
[2, 1],
[1, 1],
[1, 1],
[2, 1]])
I want to get this matrix:
array([[0, 0, 0],
[0, 2, 1], # (1,1) twice, (1,2) once
[0, 2, 0]]) # (2,1) twice
The matrix will be small (like, 5×5), and the number of indices will be large (somewhere near 10^3 or 10^5).
So, is there anything better (faster) than a for-loop?
You can still use bincount(). The trick is to convert a and b into a single 1D array of flat indices.
If the matrix is nxm, you could apply bincount() to a * m + b, and construct the matrix from the result.
To take the example in your question:
In [15]: a = np.array([1, 2, 1, 1, 2])
In [16]: b = np.array([2, 1, 1, 1, 1])
In [17]: cnt = np.bincount(a * 3 + b)
In [18]: cnt.resize((3, 3))
In [19]: cnt
Out[19]:
array([[0, 0, 0],
[0, 2, 1],
[0, 2, 0]])
If the shape of the array is more complicated, it might be easier to use np.ravel_multi_index() instead of computing flat indices by hand:
In [20]: cnt = np.bincount(np.ravel_multi_index(np.vstack((a, b)), (3, 3)))
In [21]: np.resize(cnt, (3, 3))
Out[21]:
array([[0, 0, 0],
[0, 2, 1],
[0, 2, 0]])
(Hat tip #Jaime for pointing out ravel_multi_index.)
m1 = m.view(numpy.ndarray) # Create view
m1.shape = -1 # Make one-dimensional array
m1 += np.bincount(a+m.shape[1]*b, minlength=m1.size)

Categories