I have three lists (really columns in a pandas dataframe) one with data of interest, one with x array coordinates, and one with y array coordinates. All lists are the same length and their order in the list associated with the coordinates (so L1: "Apple" coincides with L2:"1", and L3:"A"). I would like to make an array with the dimensions provided by the two coordinate lists with data from the data list. What is the best way to do this?
The expected output would be in the form of a numpy array or something like:
array = [[0,0,0,3,0,0,2,3][0,0,0,0,0,0,0,3]] #databased on below
Where in this example the array has the dimensions of y = 2 from y.unique() and x = 8 from x.unique().
The following is example input data for what I am talking about:
array_x
array_y
Data
1
a
0
2
a
0
3
a
0
4
a
3
5
a
0
6
a
0
7
a
2
8
a
3
1
b
0
2
b
0
3
b
0
4
b
0
5
b
0
6
b
0
7
b
0
8
b
3
You may be looking for pivot:
out = df.pivot(values=['Data'], columns=['array_y'], index=['array_x']).to_numpy()
Output:
array([[0, 0],
[0, 0],
[0, 0],
[3, 0],
[0, 0],
[0, 0],
[2, 0],
[3, 3]], dtype=int64)
Supposing you have a dataframe like that:
import pandas as pd
import numpy as np
myDataframe = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['x','y'])
Then you can select the columns you want and creat an array from it
my_array = np.array(myDataframe[['x','y']])
>>> my_array
array([[1, 2],
[3, 4],
[5, 6]], dtype=int64)
You could do a zip (note: I'm shorthand-ing some of your example data):
data_x = [1, 2, 3, 4, 5, 6, 7, 8] * 2
data_y = ['a'] * 8 + ['b'] * 8
data_vals = [0,0,0,3,0,0,2,3,0,0,0,0,0,0,0,3]
coll = dict()
for (x, y, val) in zip(data_x, data_y, data_vals):
if coll.get(y) is None:
coll[y] = []
if x > len(coll[y]):
coll[y].extend([0] * (x - len(coll[y])))
coll[y][x - 1] = val
result = []
for k in sorted(coll):
result.append(coll[k])
print coll
print result
Output:
{'a': [0, 0, 0, 3, 0, 0, 2, 3], 'b': [0, 0, 0, 0, 0, 0, 0, 3]}
[[0, 0, 0, 3, 0, 0, 2, 3], [0, 0, 0, 0, 0, 0, 0, 3]]
I want to merge an two equal element in an array, let say I am having an array like this
np.array([[0,1,1,2,2],
[0,1,1,2,2],
[0,2,2,2,2]])
I want to produce something like this if I am directing it right
np.array([[0,0,2,0,4],
[0,0,2,0,4],
[0,0,4,0,4]])
And this if I am moving it up
np.array([[0,2,2,4,4],
[0,0,0,0,0],
[0,2,2,2,2]])
My current code simply loops through a normal list
for i in range(4):
for j in range(3):
if mat[i][j]==matrix[i][j+1] and matrix[i][j]!=0:
matrix[i][j]*=2
matrix[i][j+1]=0
I prefer numpy and absence of loops if possible
This task is deceptively difficult to do without loops! You'll need a bunch of high-level numpy tricks to make it work. I sort of fly through them here, but I will try to link to other resources where I can.
From here, the best way to do row-wise comparison is:
a = np.array([[0,1,1,2,2],
[0,1,1,2,2],
[0,2,2,2,2]])
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
b
array([[[0 0 0 0 1 0 0 0 1 0 0 0 2 0 0 0 2 0 0 0]],
[[0 0 0 0 1 0 0 0 1 0 0 0 2 0 0 0 2 0 0 0]],
[[0 0 0 0 2 0 0 0 2 0 0 0 2 0 0 0 2 0 0 0]]],
dtype='|V20')
b.shape
(3, 1)
Notice that the innermost brackets are not an additional dimension, but an np.void object that can be compared with things like np.unique.
Still, getting the indices you want to keep isn't really easy, but here's the one-liner:
c = np.flatnonzero(np.r_[1, np.diff(np.unique(b, return_inverse = 1)[1])])
Eech. It's kinda messy. Basically you're looking for the indices where the lines change, and the first line. Normally you wouldn't need the np.unique call and could just do np.diff(b), but you can't subtract np.void. np.r_ is a shortcut for np.concatenate that's a bit more readable. And np.flatnonzero gives you the indices where your new array isn't zero (i.e. the indices you want to keep)
c
array([0, 2], dtype=int32)
There, now you can use some fancy ufunc.reduceat math to do your addition:
d = np.add.reduceat(a, c, axis = 0)
d
array([[0, 2, 2, 4, 4],
[0, 2, 2, 2, 2]], dtype=int32)
OK, now to add the zeros, we'll just plug that into an np.zero array using advanced indexing
e = np.zeros_like(a)
e[c] = d
e
array([[0, 2, 2, 4, 4],
[0, 0, 0, 0, 0],
[0, 2, 2, 2, 2]])
And there we go! You can go in other directions by transposing or flipping the matrix at the beginning and the end.
def reduce_duplicates(a):
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
c = np.flatnonzero(np.r_[1, np.diff(np.unique(b, return_inverse = 1)[1])])
d = np.add.reduceat(a, c, axis = 0)
e = np.zeros_like(a)
e[c] = d
return e
reduce_duplicates(a.T[::-1,:])[::-1,:].T #reducing right
array([[0, 0, 2, 0, 4],
[0, 0, 2, 0, 4],
[0, 0, 4, 0, 4]])
I don't have numba so I can't test speed against the other suggestion (knowing numba it is probably slower), but it is loopless and numpy.
A "vectorized" version of your function would be pretty messy, since the merges can happen at both even or odd indexes in each row/column, depending on preceding values in that row/column.
To illustrate, see how this vectorized version works on your (horizontal) example which happens to have all merges land on odd indexes:
>>> x
array([[0, 1, 1, 2, 2],
[0, 1, 1, 2, 2],
[0, 2, 2, 2, 2]])
>>> y=x==np.roll(x, 1, axis=1); y[:,1::2]=False; x*y*2
array([[0, 0, 2, 0, 4],
[0, 0, 2, 0, 4],
[0, 0, 4, 0, 4]])
But if I shift one of the rows by 1, it no longer works:
>>> x2
array([[0, 1, 1, 2, 2],
[0, 0, 1, 1, 2],
[0, 2, 2, 2, 2]])
>>> y=x2==np.roll(x2, 1, axis=1); y[:,1::2]=False; x2*y*2
array([[0, 0, 2, 0, 4],
[0, 0, 0, 0, 0],
[0, 0, 4, 0, 4]])
I'm not sure what strategy I would take next, if it is possible to implement this in a vectorized fashion, but it wouldn't be very clean.
I would suggest using numba for something like this. It will keep your code readable and should make it faster. Just add the #jit decorator to your function and evaluate how much it improves performance.
EDIT: I did some timing for you. Also there is a small fix to your function to make it coincide with your example.
>>> def foo(matrix):
... for i in range(matrix.shape[0]):
... for j in range(matrix.shape[1]-1):
... if matrix[i][j]==matrix[i][j+1] and matrix[i][j]!=0:
... matrix[i][j+1]*=2
... matrix[i][j]=0
...
>>> from numba import jit
>>> #jit
... def foo2(matrix):
... for i in range(matrix.shape[0]):
... for j in range(matrix.shape[1]-1):
... if matrix[i][j]==matrix[i][j+1] and matrix[i][j]!=0:
... matrix[i][j+1]*=2
... matrix[i][j]=0
...
>>> import time
>>> z=np.random.random((1000,1000)); start=time.time(); foo(z); print(time.time()-start)
1.0277159214
>>> z=np.random.random((1000,1000)); start=time.time(); foo2(z); print(time.time()-start)
0.00354909896851
I have a matrix 'A' whose values are shown below. After creating a matrix 'B' of ones using numpy.ones and assigning the values from 'A' to 'B' by indexing 'i' rows and 'j' columns, the resulting 'B' matrix is retaining the first row of ones from the original 'B' matrix. I'm not sure why this is happening with the code provided below.
The resulting 'B' matrix from command line is shown below:
import numpy
import numpy as np
A = np.matrix([[8,8,8,7,7,6,8,2],
[8,8,7,7,7,6,6,7],
[1,8,8,7,7,6,6,6],
[1,1,8,7,7,6,7,7],
[1,1,1,1,8,7,7,6],
[1,1,2,1,8,7,7,6],
[2,2,2,1,1,8,7,7],
[2,1,2,1,1,8,8,7]])
B = np.ones((8,8),dtype=np.int)
for i in np.arange(1,9):
for j in np.arange(1,9):
B[i:j] = A[i:j]
C = np.zeros((6,6),dtype=np.int)
print C
D = np.matrix([[1,1,2,3,3,2,2,1],
[1,2,1,2,3,3,3,2],
[1,1,2,1,1,2,2,3],
[2,2,3,2,2,2,1,3],
[1,2,2,3,2,3,1,3],
[1,2,3,3,2,3,2,3],
[1,2,2,3,2,3,1,2],
[2,2,3,2,2,3,2,2]])
print D
for k in np.arange(2,8):
for l in np.arange(2,8):
B[k,l] # point in middle
b = B[(k-1),(l-1)]
if b == 8:
# Matrix C is smaller than Matrix B
C[(k-1),(l-1)] = C[(k-1),(l-1)] + 1*D[(k-1),(l-1)]
#Output for Matrix B
B=
[1,1,1,1,1,1,1,1],
[8,8,7,7,7,6,6,7],
[1,8,8,7,7,6,6,6],
[1,1,8,7,7,6,7,7],
[1,1,1,1,8,7,7,6],
[1,1,2,1,8,7,7,6],
[2,2,2,1,1,8,7,7],
[2,1,2,1,1,8,8,7]
Python starts counting at 0, so your code should work find if you replace np.arange(1,9) with np.arange(9)
In [11]: np.arange(1,9)
Out[11]: array([1, 2, 3, 4, 5, 6, 7, 8])
In [12]: np.arange(9)
Out[12]: array([0, 1, 2, 3, 4, 5, 6, 7, 8])
As stated above: python indices start at 0.
In order to iterate over some (say matrix) indices, you should use the builtin function 'range' and not 'numpy.arange'. The arange returns an ndarray, while range returns a generator in a recent python version.
The syntax 'B[i:j]' does not refer to the element at row i and column j in an array B. It rather means: all rows of B starting at row i and going up to (but not including) row j (if B has so many rows, otherwise it returns until includingly the last row). The element at position i, j is in fact 'B[i,j]'.
The indexing syntax of python / numpy is quite powerful and performant.
For one thing, as others have mentioned, NumPy uses 0-based indexing. But even once you fix that, this is not what you want to use:
for i in np.arange(9):
for j in np.arange(9):
B[i:j] = A[i:j]
The : indicates slicing, so i:j means "all items from the i-th, up to the j-th, excluding the last one." So your code is copying every row over several times, which is not a very efficient way of doing things.
You probable wanted to use ,:
for i in np.arange(8): # Notice the range only goes up to 8
for j in np.arange(8): # ditto
B[i, j] = A[i, j]
This will work, but is also pretty wasteful performancewise when using NumPy. A much faster approach is to simply ask for:
B[:] = A
Here first what I think you are trying to do, with minimal corrections, comments to your code:
import numpy as np
A = np.matrix([[8,8,8,7,7,6,8,2],
[8,8,7,7,7,6,6,7],
[1,8,8,7,7,6,6,6],
[1,1,8,7,7,6,7,7],
[1,1,1,1,8,7,7,6],
[1,1,2,1,8,7,7,6],
[2,2,2,1,1,8,7,7],
[2,1,2,1,1,8,8,7]])
B = np.ones((8,8),dtype=np.int)
for i in np.arange(1,9): # i= 1...8
for j in np.arange(1,9): # j= 1..8, but A[8,j] and A[j,8] do not exist,
# if you insist on 1-based indeces, numpy still expects 0... n-1,
# so you'll have to subtract 1 from each index to use them
B[i-1,j-1] = A[i-1,j-1]
C = np.zeros((6,6),dtype=np.int)
D = np.matrix([[1,1,2,3,3,2,2,1],
[1,2,1,2,3,3,3,2],
[1,1,2,1,1,2,2,3],
[2,2,3,2,2,2,1,3],
[1,2,2,3,2,3,1,3],
[1,2,3,3,2,3,2,3],
[1,2,2,3,2,3,1,2],
[2,2,3,2,2,3,2,2]])
for k in np.arange(2,8): # k = 2..7
for l in np.arange(2,8): # l = 2..7 ; matrix B has indeces 0..7, so if you want inner points, you'll need 1..6
b = B[k-1,l-1] # so this is correct, gives you the inner matrix
if b == 8: # here b is a value in the matrix , not the index, careful not to mix those
# Matrix C is smaller than Matrix B ; yes C has indeces from 0..5 for k and l
# so to address C you'll need to subtract 2 from the k,l that you defined in the for loop
C[k-2,l-2] = C[k-2,l-2] + 1*D[k-1,l-1]
print C
output:
[[2 0 0 0 0 0]
[1 2 0 0 0 0]
[0 3 0 0 0 0]
[0 0 0 2 0 0]
[0 0 0 2 0 0]
[0 0 0 0 3 0]]
But there are more elegant ways to do it. In particular look up slicing, ( numpy conditional array arithmetic, possibly scipy threshold.All of the below should be much faster than Python loops too (numpy loops are written in C).
B=np.copy(A) #if you need a copy of A, this is the way
# one quick way to make a matrix that's 1 whereever A==8, and is smaller
from scipy import stats
B1=stats.threshold(A, threshmin=8, threshmax=8, newval=0)/8 # make a matrix with ones where there is an 8
B1=B1[1:-1,1:-1]
print B1
#another quick way to make a matrix that's 1 whereever A==8
B2 = np.zeros((8,8),dtype=np.int)
B2[A==8]=1
B2=B2[1:-1,1:-1]
print B2
# the following would obviously work with either B1 or B2 (which are the same)
print np.multiply(B2,D[1:-1,1:-1])
Output:
[[1 0 0 0 0 0]
[1 1 0 0 0 0]
[0 1 0 0 0 0]
[0 0 0 1 0 0]
[0 0 0 1 0 0]
[0 0 0 0 1 0]]
[[1 0 0 0 0 0]
[1 1 0 0 0 0]
[0 1 0 0 0 0]
[0 0 0 1 0 0]
[0 0 0 1 0 0]
[0 0 0 0 1 0]]
[[2 0 0 0 0 0]
[1 2 0 0 0 0]
[0 3 0 0 0 0]
[0 0 0 2 0 0]
[0 0 0 2 0 0]
[0 0 0 0 3 0]]
A cleaner way, in my opinion, of writing the C loop is:
for k in range(1,7):
for l in range(1,7):
if B[k,l]==8:
C[k-1, l-1] += D[k,l]
That inner block of B (and D) can be selected with slices, B[1:7, 1:7] or B[1:-1, 1:-1].
A and D are defined as np.matrix. Since we aren't doing matrix multiplications here (no dot products), that can create problems. For example I was puzzled why
In [27]: (B[1:-1,1:-1]==8)*D[1:-1,1:-1]
Out[27]:
matrix([[2, 1, 2, 3, 3, 3],
[3, 3, 3, 4, 5, 5],
[1, 2, 1, 1, 2, 2],
[2, 2, 3, 2, 3, 1],
[2, 2, 3, 2, 3, 1],
[2, 3, 3, 2, 3, 2]])
What I expected (and matches the loop C) is:
In [28]: (B[1:-1,1:-1]==8)*D.A[1:-1,1:-1]
Out[28]:
array([[2, 0, 0, 0, 0, 0],
[1, 2, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 0, 3, 0]])
B = A.copy() still leaves B as matrix. B=A.A returns an np.ndarray. (as does np.copy(A))
D.A is the array equivalent of D. B[1:-1,1:-1]==8 is boolean, but when used in the multiplication context it is effectively 0s and 1s.
But if we want to stick with np.matrix then I'd suggest using the element by element multiply function:
In [46]: np.multiply((A[1:-1,1:-1]==8), D[1:-1,1:-1])
Out[46]:
matrix([[2, 0, 0, 0, 0, 0],
[1, 2, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 0, 3, 0]])
or just multiply the full matrixes, and select the inner block after:
In [47]: np.multiply((A==8), D)[1:-1, 1:-1]
Out[47]:
matrix([[2, 0, 0, 0, 0, 0],
[1, 2, 0, 0, 0, 0],
[0, 3, 0, 0, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 0, 3, 0]])