Pandas/Numpy Get matrix from column of arrays - python

I have a pandas dataframe with a column of lists.
df:
inputs
0 [1, 2, 3]
1 [4, 5, 6]
2 [7, 8, 9]
3 [10, 11, 12]
I need the matrix
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
An efficient way to do this?
Note: When I try df.inputs.as_matrix() the output is
array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]], dtype=object)
which has shape (4,), not (4,3) as desired.

You can convert the column to list and then apply numpy array, if all the lists in the column have the same length, this will make a 2D array:
arr = np.array(df.inputs.tolist())
#array([[ 1, 2, 3],
# [ 4, 5, 6],
# [ 7, 8, 9],
# [10, 11, 12]])
arr.shape
# (4, 3)
Or another option use .values to access the numpy object firstly and then convert it to list as commented by #piRSquared, this is marginally faster with the example given:
%timeit df.inputs.values.tolist()
# 100000 loops, best of 3: 5.52 µs per loop
%timeit df.inputs.tolist()
# 100000 loops, best of 3: 11.5 µs per loop

Related

numpy array - efficiently subtract each row of B from A

I have two numpy arrays a and b. I want to subtract each row of b from a. I tried to use:
a1 - b1[:, None]
This works for small arrays, but takes too long when it comes to real world data sizes.
a = np.arange(16).reshape(8,2)
a
Out[35]:
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15]])
b = np.arange(6).reshape(3,2)
b
Out[37]:
array([[0, 1],
[2, 3],
[4, 5]])
a - b[:, None]
Out[38]:
array([[[ 0, 0],
[ 2, 2],
[ 4, 4],
[ 6, 6],
[ 8, 8],
[10, 10],
[12, 12],
[14, 14]],
[[-2, -2],
[ 0, 0],
[ 2, 2],
[ 4, 4],
[ 6, 6],
[ 8, 8],
[10, 10],
[12, 12]],
[[-4, -4],
[-2, -2],
[ 0, 0],
[ 2, 2],
[ 4, 4],
[ 6, 6],
[ 8, 8],
[10, 10]]])
%%timeit
a - b[:, None]
The slowest run took 10.36 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.18 µs per loop
This approach is too slow / inefficient for larger arrays.
a1 = np.arange(18900 * 41).reshape(18900, 41)
b1 = np.arange(2674 * 41).reshape(2674, 41)
%%timeit
a1 - b1[:, None]
1 loop, best of 3: 12.1 s per loop
%%timeit
for index in range(len(b1)):
a1 - b1[index]
1 loop, best of 3: 2.35 s per loop
Is there any numpy trick I can use to speed this up?
You are playing with memory limits.
If like in your examples, 8 bits are sufficient to store data, use uint8:
import numpy as np
a1 = np.arange(18900 * 41,dtype=np.uint8).reshape(18900, 41)
b1 = np.arange(2674 * 41,dtype=np.uint8).reshape(2674, 41)
%time c1=(a1-b1[:,None])
#1.02 s

Delete and duplicate rows in numpy array

In Python, let's say I have a 1366x768 numpy array. And I want to delete each second row from it (0th row remains, 1st removed, 2nd remains, 3rd removed.. and so on), and replace the empty space with a duplicate from the row which was before it (the undeleted row) at the same time.
Is it possible in numpy?
One approach -
a[::2].repeat(2,axis=0)
To make the changes in the array, assign it back.
Sample run -
In [105]: a
Out[105]:
array([[2, 5, 1, 1],
[2, 0, 2, 5],
[1, 1, 5, 7],
[0, 7, 1, 8],
[8, 5, 2, 3],
[2, 1, 0, 6],
[5, 6, 1, 6],
[7, 1, 4, 7],
[3, 8, 1, 4],
[5, 8, 8, 8]])
In [106]: a[::2].repeat(2,axis=0)
Out[106]:
array([[2, 5, 1, 1],
[2, 5, 1, 1],
[1, 1, 5, 7],
[1, 1, 5, 7],
[8, 5, 2, 3],
[8, 5, 2, 3],
[5, 6, 1, 6],
[5, 6, 1, 6],
[3, 8, 1, 4],
[3, 8, 1, 4]])
If we care about performance, here's another approach using NumPy strides -
def strided_app(a):
m0,n0 = a.strides
m,n = a.shape
strided = np.lib.stride_tricks.as_strided
return strided(a,shape=(m//2,2,n),strides=(2*m0,0,n0)).reshape(-1,n)
Sample run -
In [154]: a
Out[154]:
array([[4, 8, 7, 7],
[5, 5, 1, 7],
[1, 8, 1, 3],
[6, 6, 5, 6],
[0, 2, 6, 3],
[6, 6, 8, 7],
[7, 6, 8, 1],
[7, 8, 8, 2],
[4, 0, 2, 8],
[5, 8, 1, 4]])
In [155]: strided_app(a)
Out[155]:
array([[4, 8, 7, 7],
[4, 8, 7, 7],
[1, 8, 1, 3],
[1, 8, 1, 3],
[0, 2, 6, 3],
[0, 2, 6, 3],
[7, 6, 8, 1],
[7, 6, 8, 1],
[4, 0, 2, 8],
[4, 0, 2, 8]])
Timings -
In [156]: arr = np.arange(1000000).reshape(1000, 1000)
# Proposed soln-1
In [157]: %timeit arr[::2].repeat(2,axis=0)
1000 loops, best of 3: 1.26 ms per loop
# #Psidom 's soln
In [158]: %timeit arr[1::2] = arr[::2]
1000 loops, best of 3: 928 µs per loop
In [159]: arr = np.arange(1000000).reshape(1000, 1000)
# Proposed soln-2
In [160]: %timeit strided_app(arr)
1000 loops, best of 3: 830 µs per loop
Looks like you have an even number of rows, in which case, you can use assignment (assign the odd rows values to corresponding even rows):
arr = np.array([[1,4],[3,1],[2,3],[2,2]])
arr[1::2] = arr[::2]
arr
#array([[1, 4],
# [1, 4],
# [2, 3],
# [2, 3]])
This avoids copying the entire array, but doesn't work if the array has odd number of rows.
Timing: Here is a comparison of the timing, the assignment does seem faster.
arr = np.arange(1000000).reshape(1000, 1000)
%timeit arr[::2].repeat(2,axis=0)
1000 loops, best of 3: 913 µs per loop
%timeit arr[1::2] = arr[::2]
1000 loops, best of 3: 655 µs per loop
This works for both even and an odd number of rows.
for i in range(1,len(a),2):
a[i] = a[i-1]

Numpy - create matrix with rows of vector

I have a vector [x,y,z,q] and I want to create a matrix:
[[x,y,z,q],
[x,y,z,q],
[x,y,z,q],
...
[x,y,z,q]]
with m rows. I think this could be done in some smart way, using broadcasting, but I can only think of doing it with a for loop.
Certainly possible with broadcasting after adding with m zeros along the columns, like so -
np.zeros((m,1),dtype=vector.dtype) + vector
Now, NumPy already has an in-built function np.tile for exactly that same task -
np.tile(vector,(m,1))
Sample run -
In [496]: vector
Out[496]: array([4, 5, 8, 2])
In [497]: m = 5
In [498]: np.zeros((m,1),dtype=vector.dtype) + vector
Out[498]:
array([[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2]])
In [499]: np.tile(vector,(m,1))
Out[499]:
array([[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2]])
You can also use np.repeat after extending its dimension with np.newaxis/None for the same effect, like so -
In [510]: np.repeat(vector[None],m,axis=0)
Out[510]:
array([[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2]])
You can also use integer array indexing to get the replications, like so -
In [525]: vector[None][np.zeros(m,dtype=int)]
Out[525]:
array([[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2]])
And finally with np.broadcast_to, you can simply create a 2D view into the input vector and as such this would be virtually free and with no extra memory requirement. So, we would simply do -
In [22]: np.broadcast_to(vector,(m,len(vector)))
Out[22]:
array([[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2],
[4, 5, 8, 2]])
Runtime test -
Here's a quick runtime test comparing the various approaches -
In [12]: vector = np.random.rand(10000)
In [13]: m = 10000
In [14]: %timeit np.broadcast_to(vector,(m,len(vector)))
100000 loops, best of 3: 3.4 µs per loop # virtually free!
In [15]: %timeit np.zeros((m,1),dtype=vector.dtype) + vector
10 loops, best of 3: 95.1 ms per loop
In [16]: %timeit np.tile(vector,(m,1))
10 loops, best of 3: 89.7 ms per loop
In [17]: %timeit np.repeat(vector[None],m,axis=0)
10 loops, best of 3: 86.2 ms per loop
In [18]: %timeit vector[None][np.zeros(m,dtype=int)]
10 loops, best of 3: 89.8 ms per loop

Quick way to upsample numpy array by nearest neighbor tiling [duplicate]

This question already has answers here:
How to repeat elements of an array along two axes?
(5 answers)
Closed 3 years ago.
I have a 2D array of integers that is MxN, and I would like to expand the array to (BM)x(BN) where B is the length of a square tile side thus each element of the input array is repeated as a BxB block in the final array. Below is an example with a nested for loop. Is there a quicker/builtin way?
import numpy as np
a = np.arange(9).reshape([3,3]) # input array - 3x3
B=2. # block size - 2
A = np.zeros([a.shape[0]*B,a.shape[1]*B]) # output array - 6x6
# Loop, filling A with tiled values of a at each index
for i,l in enumerate(a): # lines in a
for j,aij in enumerate(l): # a[i,j]
A[B*i:B*(i+1),B*j:B*(j+1)] = aij
Result ...
a= [[0 1 2]
[3 4 5]
[6 7 8]]
A = [[ 0. 0. 1. 1. 2. 2.]
[ 0. 0. 1. 1. 2. 2.]
[ 3. 3. 4. 4. 5. 5.]
[ 3. 3. 4. 4. 5. 5.]
[ 6. 6. 7. 7. 8. 8.]
[ 6. 6. 7. 7. 8. 8.]]
One option is
>>> a.repeat(2, axis=0).repeat(2, axis=1)
array([[0, 0, 1, 1, 2, 2],
[0, 0, 1, 1, 2, 2],
[3, 3, 4, 4, 5, 5],
[3, 3, 4, 4, 5, 5],
[6, 6, 7, 7, 8, 8],
[6, 6, 7, 7, 8, 8]])
This is slightly wasteful due to the intermediate array but it's concise at least.
Here's a potentially fast way using stride tricks and reshaping:
from numpy.lib.stride_tricks import as_strided
def tile_array(a, b0, b1):
r, c = a.shape # number of rows/columns
rs, cs = a.strides # row/column strides
x = as_strided(a, (r, b0, c, b1), (rs, 0, cs, 0)) # view a as larger 4D array
return x.reshape(r*b0, c*b1) # create new 2D array
The underlying data in a is copied when reshape is called, so this function does not return a view. However, compared to using repeat along multiple axes, fewer copying operations are required.
The function can be then used as follows:
>>> a = np.arange(9).reshape(3, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> tile_array(a, 2, 2)
array([[0, 0, 1, 1, 2, 2],
[0, 0, 1, 1, 2, 2],
[3, 3, 4, 4, 5, 5],
[3, 3, 4, 4, 5, 5],
[6, 6, 7, 7, 8, 8],
[6, 6, 7, 7, 8, 8]])
>>> tile_array(a, 3, 4)
array([[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
[3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5],
[3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5],
[3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5],
[6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8],
[6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8],
[6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8]])
Now, for small blocks, this method is a little slower than using repeat but faster than kron.
For slightly larger blocks, however, it becomes quicker than other alternatives. For instance, using a block shape of (20, 20):
>>> %timeit tile_array(a, 20, 20)
100000 loops, best of 3: 18.7 µs per loop
>>> %timeit a.repeat(20, axis=0).repeat(20, axis=1)
10000 loops, best of 3: 26 µs per loop
>>> %timeit np.kron(a, np.ones((20,20), a.dtype))
10000 loops, best of 3: 106 µs per loop
The gap between the methods increases as the block size increases.
Also if a is a large array, it may be quicker than alternatives:
>>> a2 = np.arange(1000000).reshape(1000, 1000)
>>> %timeit tile_array(a2, 2, 2)
100 loops, best of 3: 11.4 ms per loop
>>> %timeit a2.repeat(2, axis=0).repeat(2, axis=1)
1 loops, best of 3: 30.9 ms per loop
Probably not the fastest, but..
np.kron(a, np.ones((B,B), a.dtype))
It does the Kronecker product, so it involves a multiplication for each element in the output.

Difference between A[1:3][0:2] and A[1:3,0:2]

I can't figure out the difference between these two kinds of indexing. It seems like they should produce the same results but they do not. Any explanation?
A[1:3, 0:2] takes rows 1-3 and columns 0-2 thus returning a 2x2 array.
A[1:3][0:2] first takes rows 1-3 and from this subarray takes the rows 0-2, resulting in a 2xn array where n is the original number of columns.
In [1]: import numpy as np
In [2]: a = np.arange(16).reshape(4,4)
In [3]: a
Out[3]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In [4]: a[1:3,0:2]
Out[4]:
array([[4, 5],
[8, 9]])
In [5]: a[1:3]
Out[5]:
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [6]: a[1:3][0:2]
Out[6]:
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
The equivalent of A[1:3,0:2] using two [] is: A[1:3][:,0:2]:
In [7]: a[1:3][:,0:2]
Out[7]:
array([[4, 5],
[8, 9]])
Where : means "all the rows". So you are first selecting the rows via [1:3] and then, from all the rows select columns 0-2.
A[1:3][0:2] means first apply [1:3] on A, and then apply [0:2] on the array returned from the first step, so both slicing are only applied on the rows. OTOH A[1:3, 0:2] means apply 1:3 on the rows and 0:2 on columns, ie. get second and third row only and get only the first two columns of those rows.
>>> import numpy as np
>>> a = np.arange(12).reshape(3, 4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a[1:3][0:2]
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a[1:3] #Get 2nd and 3rd row.
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> _[0:2] #Get the first two rows of the last array.
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a[1:3, 0:2]
array([[4, 5],
[8, 9]])

Categories