numpy vectorized operation for a large array - python

I am trying to do some computations for a numpy array by python3.
the array:
c0 c1 c2 c3
r0 1 5 2 7
r1 3 9 4 6
r2 8 2 1 3
Here the "cx" and "rx" are column and row names.
I need to compute the difference of each element by row if the elements are not in a given column list.
e.g.
given a column list [0, 2, 1] # they are column indices
which means that
for r0, we need to calculate the difference between the c0 and all other columns, so we have
[1, 5-1, 2-1, 7-1]
for r1, we need to calculate the difference between the c2 and all other columns, so we have
[3-4, 9-4, 4, 6-4]
for r2, we need to calculate the difference between the c1 and all other columns, so we have
[8-2, 2, 1-2, 3-2]
so, the result should be
1 4 1 6
-1 5 4 2
6 2 -1 1
Because the array could be very large, I would like to do the calculation by numpy vectorized operation, e.g. broadcasting.
BuT, I am not sure how to do it efficiently.
I have checked Vectorizing operation on numpy array, Vectorizing a Numpy slice operation, Vectorize large NumPy multiplication, Replace For Loop with Numpy Vectorized Operation, Vectorize numpy array for loop.
But, none of them work for me.
thanks for any help !

Extract the values from the array first and then do subtraction:
import numpy as np
a = np.array([[1, 5, 2, 7],
[3, 9, 4, 6],
[8, 2, 1, 3]])
cols = [0,2,1]
# create the index for advanced indexing
idx = np.arange(len(a)), cols
# extract values
vals = a[idx]
# subtract array by the values
a -= vals[:, None]
# add original values back to corresponding position
a[idx] += vals
print(a)
#[[ 1 4 1 6]
# [-1 5 4 2]
# [ 6 2 -1 1]]
Playground

Related

Get Item Index and row Index of Matrices in a list to a matrix

I am pretty new to Python and am trying to tackle the following problem:
I have a list, for example: (items in list are either empty or arrays of dimension n x 2):
a = np.array([])
b = np.array([1,2])
c = np.array([[5,6],[7,8]])
d = np.array([[11,12],[13,14],[15,16]])
D1 = [a,b,c,d]
I am then removing the empty members of the list like this:
D2 = [i for i in D1 if len(i) > 0]
to be able to stack the remaining items of the list in one 2D Array like this:
E = np.vstack(D2)
which gives me:
E = [[ 1 2]
[ 5 6]
[ 7 8]
[11 12]
[13 14]
[15 16]]
Now, what I am trying to get is a matrix looking like this:
out = [[1 0]
[2 0]
[2 1]
[3 0]
[3 1]
[3 2]]
Explanation:
The first column of out corresponds to the List Index of the corresponding row of E from D1 (List with empty entry!).
The second column of out corresponds to the Row Index of the corresponding row of E from D1.
Example:
The 3rd row of out[2,:] = [2 1] means: The 3rd row of E originates D1 at List Index 2 and Row Index 1 of the matrix in D1.
I am happy to make it more clear if needed. Any help is appreciated.
I think you don't need to define D2 and E, you can work with D1 only. then for every item in D1 you need to store the index of the item and the indices of the elements of that item in an array. after you need to filter this array and get rid of the empty elements. The last step is to stack the remaining elements of the array.
And you're done!
The last thing in the b NumPy array you forget to put the elements within another array like you did in c and d.
import numpy as np
a = np.array([])
b = np.array([[1,2]])
c = np.array([[5,6],[7,8]])
d = np.array([[11,12],[13,14],[15,16]])
D1 = [a,b,c,d]
M = [ [[i,j] for j in range(len(D1[i])) ]for i in range(len(D1)) ]
M = [ x for x in M if len(x)> 0 ]
M = np.vstack(M)
print(M)

Checking min distance between multiple slices of dataframe of co-ordinates

Having two sets of lists with cells ID
A = [4, 6, 2493, 2494, 2495]
B = [3, 7, 4983, 4982, 4984, 4981, 4985, 2492, 2496]
And each cell from the upper lists has X, Y coordinates in a seperate columns in a df, each.
df
cell_ID; X ;Y
1; 5; 6
2; 10; 6
...
Where values in the A, B list are the ones in cell_id column. How can I find the sum of distances between cells in A and B, but primarly looking at cells in A in relationship to B? So I have to calculate 5 (A lenght) distances for each cell in A, take min() of those 5 and sum() all those nine min values. I hope that makes sense
I was thinking the following:
Take first value from list A (this is cell with id = 4) and calculate distance between all cells in B and take further only the min value
Repeat step 1 with all other values in A
Make a sum() of all distances
I tried with the code below... but failed
def sum_distances(df, i, col_X='X', col_Y='Y'):
for i in range(A)
return (((df.iloc[B][col_X] - df.iloc[i,2])**2 + (df.iloc[B][col_Y] - df.iloc[i,3])**2)**0.5).min
I don't know how to integrate min() and sum() at the same time.
If I'm not mistaken, you're looking for the Euclidean distance between (x,y) co-ordinates. Here is one possible approach (based on this SO post)
Generate some dummy data in the same format of the OP
import pandas as pd
import numpy as np
A = [0, 1, 2, 3, 4]
B = [10, 11, 12, 13, 14, 9, 8, 7, 6]
df = pd.DataFrame(np.random.rand(15,2), columns=['X','Y'], index=range(15))
df.index.name = 'CellID'
print('Raw data\n{}'.format(df))
Raw data
X Y
CellID
0 0.125591 0.890772
1 0.754238 0.644081
2 0.952322 0.099627
3 0.090804 0.809511
4 0.514346 0.041740
5 0.678598 0.230225
6 0.594182 0.432634
7 0.005777 0.891440
8 0.925187 0.045035
9 0.903591 0.238609
10 0.187591 0.255377
11 0.252635 0.149840
12 0.513432 0.972749
13 0.433606 0.550940
14 0.104991 0.440052
To get the minimum distance between each index of B and A
# Get df at indexes from list A: df_A
df_A = df.iloc[A,]
# For df at each index from list B (df.iloc[b,]), get distance to df_A: d
dist = []
for b in B:
d = (pd.DataFrame(df_A.values - df.iloc[b,].values)**2).sum(1)**0.5
dist.append(d.min())
print('Sum of minimum distances is {}'.format(sum(dist)))
Output (for sum of minimum distances between each index of B and A)
Sum of minimum distances is 2.36509386378

Is there a way to check for linearly dependent columns in a dataframe?

Is there a way to check for linear dependency for columns in a pandas dataframe? For example:
columns = ['A','B', 'C']
df = pd.DataFrame(columns=columns)
df.A = [0,2,3,4]
df.B = df.A*2
df.C = [8,3,5,4]
print(df)
A B C
0 0 0 8
1 2 4 3
2 3 6 5
3 4 8 4
Is there a way to show that column B is a linear combination of A, but C is an independent column? My ultimate goal is to run a poisson regression on a dataset, but I keep getting a LinAlgError: Singular matrix error, meaning no inverse exists of my dataframe and thus it contains dependent columns.
I would like to come up with a programmatic way to check each feature and ensure there are no dependent columns.
If you have SymPy you could use the "reduced row echelon form" via sympy.matrix.rref:
>>> import sympy
>>> reduced_form, inds = sympy.Matrix(df.values).rref()
>>> reduced_form
Matrix([
[1.0, 2.0, 0],
[ 0, 0, 1.0],
[ 0, 0, 0],
[ 0, 0, 0]])
>>> inds
[0, 2]
The pivot columns (stored as inds) represent the "column numbers" that are linear independent, and you could simply "slice away" the other ones:
>>> df.iloc[:, inds]
A C
0 0 8
1 2 3
2 3 5
3 4 4

Matlab to python conversion matrix operations

Hi I am trying to covert this distance formula for rectilinear distance from matlab to python. X1 and X2 are two matrices of two dimensional points and could be differing lengths.
nd = size(X1); n = nd(1);
d = nd(2);
m = size(X2,1);
D = abs(X1(:,ones(1,m)) - X2(:,ones(1,n))') + ...
abs(X1(:,2*ones(1,m)) - X2(:,2*ones(1,n))');
I think the problem I am having most in python is appending the ones matrices with X1 and X2 since they are np.arrays.
First your code:
octave:1> X1=[0,1,2,3;2,3,1,1]'
octave:2> X2=[2,3,2;4,2,4]'
<your code>
octave:21> D
D =
4 3 4
2 3 2
3 2 3
4 1 4
Matching numpy code:
X1=np.array([[0,1,2,3],[2,3,1,1]]).T
X2=np.array([[2,3,2],[4,2,4]]).T
D=np.abs(X1[:,None,:]-X2[None,:,:]).sum(axis=-1)
produces, D:
array([[4, 3, 4],
[2, 3, 2],
[3, 2, 3],
[4, 1, 4]])
numpy broadcasts automatically, so it doesn't need the ones() to expand the dimensions. Instead I use None (same as np.newaxis) to create new dimensions. The difference is then 3D, which is then summed on the last axis.
I forgot how spoilt we are with the numpy broadcasting. Though newer Octave has something similar:
D = sum(abs(reshape(X1,[],1,2)-reshape(X2,1,[],2)),3)

Make Tuples from Specific Pandas Columns

I have a pandas dataframe, e.g.
one two three four five
0 1 2 3 4 5
1 1 1 1 1 1
What I would like is to be able to convert only a select number of columns to a list, such that we obtain:
[[1,2],[1,1]]
This is the rows 0,1, where we are selecting columns one and two.
Similarly if we selected columns one, two, four:
[[1,2,4],[1,1,1]]
Ideally I would like to avoid iteration of rows as it is slow!
You can select just those columns with:
In [11]: df[['one', 'two']]
Out[11]:
one two
0 1 2
1 1 1
and get the list of lists from the underlying numpy array using tolist:
In [12]: df[['one', 'two']].values.tolist()
Out[12]: [[1, 2], [1, 1]]
In [13]: df[['one', 'two', 'four']].values.tolist()
Out[13]: [[1, 2, 4], [1, 1, 1]]
Note: this should never really be necessary unless this is your end game... it's going to be much more efficient to do the work inside pandas or numpy.
So I worked out how to do it.
Firstly we select the columns we would like the values from:
y = x[['one','two']]
This gives us a subset df.
Now we can choose the values:
> y.values
array([[1, 2],
[1, 1]])

Categories