How to solve/what is the best way to approach it? - python

I have confusion/difficulties related to this code: the code didn't print out anything yet there is no error message. The context is, I want to change some components of the main random matrix (psi) to the new matrix (psiy) and I want to check whether my code is correctly with regard to the component's index (hence the print(init,sy)) but nothing comes out and no error message either. Anyone has any idea? Thank you very much in advance.
The full problem is here:
I have a 3D matrix (nx X ny X nz), with nx being the number of components in the x-axis (say, horizontal), ny being the number of components in the y-axis (say, vertical), and nz being the number of component in the z-axis (say, out of the plane). So the total of components is A. We can also see it as a 2x2 matrix with many layers (nz total of layers). The index of each component started from the top left of the first layer to the bottom right of the z-th layer.
So for a 4x4x4 matrix, we will have 0 to 63 indexes with 0 to 15 in the first layer, 16 to 31 in the second layer, and so on. And for the first layer, since nx and ny is 4, there are a total of 4 indexes in each row and column (0 to 3 for the first row, 4 to 7 for the second row, and 12 to 15 for the fourth row), the same order for the other layers.
I want to change my initial random matrix psi to a new matrix psiy with these conditions:
The init-th components of the new matrix psiy will be the init-h components of the old matrix psi times 3
The sy-th components of the new matrix psiy will be the sy-th components of the old matrix psi
Now, how do we know what are the init and sy components? We go back to the description of components of my 3D matrix, if we group it with respect to the column, for the first layer (index 0 to 15), there are 4 columns since ny is 4 and what I mean by init is the top index of the column, in this case the 0, 1, 2, and 3 indexes. So for the second layer init would be 16, 17, 18, and 19 indexes and so on for the rest of the layers.
While for sy, the definition is all the indexes in the even-th rows except for the last row. So for the case of nx, ny and nz is 4, sy indexes would be:
4 to 7 for the first layer (since there are only 4 rows (nx) and the only even-th row is the 2nd row (not including the last row)), and
20 to 23 for the second layer, 36 to 39 for the third layer, and 52 to 55 for the last layer.
So for ny is for example 8, the sy would be all the indexes in the 2nd, 4th, and 6th rows in each layer and so on for any (even) number of ny.
Thank you.
import numpy as np
nx=4
ny=4
nz=4
A=nx*ny*nz
HH=ny-2
H=int(HH/2)
psiy=np.zeros(A)
psi=np.random.randint(1,10,nx)
for i in range (0,nz):
for m in range (1,H):
for k in range (0,nx):
init=i*nx*ny+k
sy=init+(2*m-1)*nx
psiy[init]=3*psi[init]
psiy[sy]=1*psi[sy]
print(init,sy)

Related

Trouble in understanding how PCA is achieving image compression and reducing dimension

I was going through this amazing playlist for SVD by Steve Brunton in youtube. I think I got majority of the concepts but there are some gaps. Let me add a couple of screenshots so that it's easier for me to explain.
He is considering the input matrix X to be a collection of images. So, considering an image is 28x28 pixels, we flatten it to create a 784x1 column vector. So, each column denotes an image, and the rows denote pixel indices. Let's take the dimension of X to be n x m. Now, after computing the economy SVD, if we keep only the first r (<< m) singular values, then the approximation of X is given by
X' = σ1.u1.v1(T) + σ2.u2.v2(T) + ... + σr.ur.vr(T)
I understand that here, we're throwing away information, so the reconstructed images would be pixelated but they would still be of the same dimension (28x28). So, how are we achieving compression here? Is it because instead of storing 784m pixel values, we'll have to store r x (28 (length of each u) + 28 (length of each v)) pixels? Or is there something more to it?
My second question is, if I try to draw an analogy to numerical features, e.g. let's say a housing price dataset, that has 50 features, and 1000 data points. So, our X matrix has dimension 50 x 1000 (each column being a feature vector). In that case, if there are useless features, we'll get << 50 features (maybe 20, or 10... whatever) after applying PCA, right? I'm not able to grasp how that smaller feature vector is derived when we select only the biggest r singular values. Because X and X' have the same dimensions.
Let's have a sample code. The dimensions are reversed because of how sklearn expects it.
pca = PCA(n_components=10)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape) # original shape: (1000, 50)
print("transformed shape:", X_pca.shape) # transformed shape: (1000, 10)
So, how are we going from 50 to 10 here? I get that that in this case there would be 50 U basis vectors. So, even if we choose top r from these 50, the dimensions will still be the same, right? Any help is appreciated.
I've been searching for the answer all over the web, and finally it clicked when I saw this video tutorial. We know X = U x ∑ x V.T. Here, columns of U give us the principal components for the colspace of X. Similarly rows of V.T give us the principal components for the rowspace of X. Since, in pca we tend to represent a feature vector by a row (unlike svd), so we'd select the first r principal components from the matrix V.T.
Let's assume the dimensions of X to be mxn. So, we have m samples each having n features. That gives us the following dimensions for the SVD:
U: mxm
∑: mxn
V: nxn
Now, if we select only r (<< n) principal components then the projection of X to the r-dimensional space would be given by X.[v1 v2 ... vr]. Here each of v1, v2, ... vr is a column vector. So, the dimension of [v1 v2 ... vr] is nxr. If we now multiply X with this vector we get an nxr matrix, which is nothing but the projection of all the data points to r dimensions.

Adding 3rd dimension to 2D array in Python

I'm having a 2D array of dummy variables (0 and 1) with the shape of (4432, 35) -> 4432 videos including 35 different customers. Since the videos contain of 1800 frames I want to add a third dimension to this array with 1800 time steps (frames) so that it gets the shape (4432, 35, 1800). So I want Python to multiplicate the zeros and ones in the 2nd dimension 1800 times into the 3rd dimension.
How can I do that?
with an array called array with any 2D shape:
array = [[[j for k in range(1800)] for j in i] for i in array]
This will create a 3rd dimension with 1800 duplicates of the values in the second dimension.
It also seems to make more sense to have a shape (4432, 1800, 35): (video, frame, customers in frame):
array = [[i for k in range(1800)] for i in array]

How to broadcast correctly subtracting 2 different matrices in Numpy

I am trying to subtract two matrices of different shapes using broadcasting. However I am stuck on some point. Need simple solution of how to solve the problem.
Literally I am evaluating data on a grid (first step is subtracting). For example I have 5 grid points grid = (-20,-10, 0, 10, 20) and array of data of length 100.
Line:
u = grid.reshape((ngrid, 1)) - data
works perfectly fine. ngrid = 5 in this trivial example.
Output is matrix of 5 rows and 100 columns, so each point of data is evaluated on each point of grid.
Next I want to do it for 2 grids and 2 data sets simultaneously (data is of size (2x100, e.g. 2 randn arrays). I have already succeeded in subtracting two data sets from one grid, but using two grids throws an error.
In the example below a is vertical array of the grid, length 5 points and data is array of random data of the shape (100,2).
In this case u is is tuple (2,5,100), so u[0] and u[1] has 5 rows and 100 columns, meaning that data was subtracted correctly from the grid.
Second line of the code is what I am trying to do. The error is following:
ValueError: operands could not be broadcast together with shapes (5,2) (2,1,100)
u = a - data.T[:, None] # a is vertical grid of 5 elements. Works ok.
u = grid_test - data.T[:, None] # grid_test is 2 column 5 row matrix of 2 grids. Error.
What I need is kind of same line of code as above, but it should work if "a" contains 2 columns, e.g. two different grids. So in the end expected result is "u", which contains in addition to above described results another two matrices where same data (both arrays) evaluated on the second grid.
Unfortunately I cannot use any loops - only vectorization and broadcasting.
Thanks in advance.

Cosine similarity between two ndarrays

I have two numpy arrays, first array is of size 100*4*200, and second array is of size 150*6*200. In fact, I am storing the 100 samples of 200 dimensional vector representations of 4 fields in array 1 and 140 samples of 200 dimensional vectors of 6 fields in array 2.
Now I want to compute the similarity vector between the samples and create a similarity matrix. For each sample, I would like to calculate the similarity between the each combination of fields and store it such that I get a 15000*24 dimensional array.
First 150 rows will be the similarity vector between 1st row of array 1 and 150 rows of array 2, next 150 rows will be the similarity vector between the 2nd row of array 1 and 150 rows of array 2 etc.
Each similarity vector is # fields in array 1 * # fields in array 2 i.e. 1st element of the similarity vector is cosine similarity between field 1 of array 1 and field 1 of array 2, 2nd element will be the similarity between field 1 of array 1 and field 2 of array 2 and so on with last element is the similarity between last field of array 1 and last field of array 2.
What is the best way to do this using numpy arrays ?
So every "row" (i assume the first axis, that I'll call axis 0) is the sample axis. That means you have 100 samples from one vector, each with fieldsxdimentions 4x200.
Doing this the way you describe, then the first row of the first array would have (4,200) and the second one would then have (150,6,200). Then you'd want to do a cos distance between an (m,n), and (m,n,k) array, which does not make sense (the closest you have to a dot product here would be the tensor product, which I'm fairly sure is not what you want).
So we have to extract these first and then iterate over all the others.
To do this I actually recomend just splitting the array with np.split and iterate over both of them. This is just because I've never come across a faster way in numpy. You could use tensorflow to gain efficiency, but I'm not going into that here in my answer.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
a = np.random.rand(100, 4, 200)
b = np.random.rand(150, 6, 200)
# We know the output will be 150*100 x 6*4
c = np.empty([15000, 24])
# Make an array with the rows of a and same for b
a_splitted=np.split(a, a.shape[0], 0)
b_splitted=np.split(b, b.shape[0], 0)
i=0
for alpha in a_splitted:
for beta in b_splitted:
# Gives a 4x6 matrix
sim=cosine_similarity(alpha[0],beta[0])
c[i,:]=sim.ravel()
i+=1
For the similarity-function above I just chose what #StefanFalk sugested: sklearn.metrics.pairwise.cosine_similarity. If this similarity measure is not sufficient, then you could either write your own.
I am not at all claiming that this is the best way to do this in all of python. I think the most efficient way is to do this symbolically using, as mentioned, tensorflow.
Anyways, hope it helps!

Python: Get median in 3-dimensional numpy array

I have a 3-dimensional numpy array, where the first two dimensions form a grid, and the third dimension (let's call it cell) is a vector of attributes. Here is an example for array x (a 2x3 grid with 4 attributes in each cell):
[[[1 2 3 4][5 6 7 8][9 8 7 6]]
[[9 8 7 6][5 4 3 2][1 2 3 4]]]
for which I want to get the median of the 8 neighbors of each cell in array x, e.g. for x[i,j,:] it would be the median of all cells with an index combined of i-1, i+1, j-1, j+1. It is clear how to do that, but for the borders the index would get out of range (e.g. if i=0, a general solution where I take x[i-1,j,:] into the calculation wouldn't work).
Now the simple solution would be (simple in the sense of not thought through) to separately treat the 4 corners (e.g. where i=j=0), borders (e.g. where i=0 and j!=0) and the default case for cells in the middle with if statements, but I would hope that there is a more elegant solution for this problem. I thought to extend the n*m grid to a (n+2)*(m+2) grid and fill the border cells on all sides with 0 values, but that would distort the median computation.
I hope I was able to kind of clarify the problem. Thanks in advance for any suggestions for a more elegant way to solve this.

Categories