I have a rather large matrix (500000 * 24) as an ndarray and I want to multiply its cell with the corresponding column min. I have already done this with for loops but I keep reading that this is not the NumPy way of doing things.
Is there a proper way of doing such an operation (I might also want to substract a constant later)?
Thanks in advance
Yes you can simply multiply your array with the minimum vector directly, an example is shown below.
import numpy as np
data = np.random.random((500000, 24))
# This returns an array of size 500,000 that is the row of 24 values
minimum = data.min(axis=1)
data = data * minimum
If you wish to create a minimum array of size 24 (where the minimum of the 500,000 values is taken) then you would choose axis=0.
This set of slides discusses how such operations can work.
Would normal multiply not do?
import numpy
a = numpy.random.random((4,2))
b = a * numpy.min(a,axis=0)
Related
I have a data frame with shape:
(20,30,1024)
I want to find the Euclidean distance between every entry and every other entry in the dataframe (ideally non-redundantly, i.e. don't find the distance of row 1 and 5....and then row 5 and 1 but not there yet). I have this code:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df_test,metric='euclidean')
dist_matrix = squareform(distances)
print(dist_matrix)
The error says:
A 2-dimensional array must be passed.
So I guess I want to convert my matrix from shape (20,30,1024) to (20,30720), and then calculate the pdist/squareform between the rows (i.e. 20 rows of vectors that are 30720 in length).
I know that I can use test_df[0:20].flatten().tolist()
But that completely flattened my matrix, the output shape was (1,614400).
Can someone show me how to convert a shape from (20,30,1024) to (20,3072), or if i'm not going about this the right way?
The ultimate end goal is to calculate Euclidean distance between all non-redundant pairs in a data set, but the data set is big, so I need to do it as efficiently as possible/not duplicating calculations.
The most straightforward way to reshape that I can think of, according to how you described the problem, is:
df_test.values.reshape(20, -1)
By calling .values, you are retrieving your dataframe data as a numpy array. From there, .reshape finishes your job. Since you need a 2D-array, you provide the size of the first dimension (in your case, 20), and by passing -1 Numpy will calculate the size of the second dimension for you (in this case it will multiply the remaining dimension sizes in the original 3D-array)
I have a matrix A with 500 rows and 1024 columns. I would like to generate a matrix consisting of evenly spaced columns from A, say with step size 2^5. How do I do this in Numpy? I haven't seen this explained in the references I have.
You can just use slicing:
import numpy as np
arr = np.random.rand(512,1024)
step_size = 2 ** 5
arr[:, ::step_size] # shape is (512, 32)
So what it does is keeping all the rows, while taking all the columns with the desired step size. You can read about numpy indexing in the following link:
https://numpy.org/doc/stable/user/basics.indexing.html?highlight=indexing#other-indexing-options
You can apply the same logic to the rows or to both rows and columns to get a more sophisticated slicing.
I have an array of values and would like to create a matrix from that, where each row is my starting point vector multiplied by a sample from a (normal) distribution.
The number of rows of this matrix will then vary in dependence from the number of samples I want.
%pylab
my_vec = array([1,2,3])
my_rand_vec = my_vec*randn(100)
Last command does not work, because array shapes do not match.
I could think of using a for loop, but I am trying to leverage on array operations.
Try this
my_rand_vec = my_vec[None,:]*randn(100)[:,None]
For small numbers I get for example
import numpy as np
my_vec = np.array([1,2,3])
my_rand_vec = my_vec[None,:]*np.random.randn(5)[:,None]
my_rand_vec
# array([[ 0.45422416, 0.90844831, 1.36267247],
# [-0.80639766, -1.61279531, -2.41919297],
# [ 0.34203295, 0.6840659 , 1.02609885],
# [-0.55246431, -1.10492863, -1.65739294],
# [-0.83023829, -1.66047658, -2.49071486]])
Your solution my_vec*rand(100) does not work because * corresponds to the element-wise multiplication which only works if both arrays have identical shapes.
What you have to do is adding an additional dimension using [None,:] and [:,None] such that numpy's broadcasting works.
As a side note I would recommend not to use pylab. Instead, use import as in order to include modules as pointed out here.
It is the outer product of vectors:
my_rand_vec = numpy.outer(randn(100), my_vec)
You can pass the dimensions of the array you require to numpy.random.randn:
my_rand_vec = my_vec*np.random.randn(100,3)
To multiply each vector by the same random number, you need to add an extra axis:
my_rand_vec = my_vec*np.random.randn(100)[:,np.newaxis]
I am trying to apply lfilter on a collection of 1D arrays, i.e. on a 2D array which its rows correspond to different signals. This is the code:
import numpy as np
from scipy import signal
from scipy import stats
sysdim=2 #dimension of filter, i.e. the amount that it depends on the past
ksim=100 #number of different singals to be filtered
x_size=10000
# A and C are
A=np.random.randn(sysdim*ksim).reshape((ksim,sysdim))
B=np.random.randn(sysdim*ksim).reshape((ksim,sysdim))
C=2.0*np.random.randn(sysdim*ksim).reshape((ksim,sysdim))
D=2.0*np.random.randn(sysdim*ksim).reshape((ksim,sysdim))
print A.shape,np.random.randn(x_size*ksim).reshape((ksim,x_size)).shape
x=signal.lfilter(A,np.hstack((np.ones((ksim,1)),C)),np.random.randn(x_size*ksim).reshape((ksim,x_size)),axis=1)
y=signal.lfilter(B,np.hstack((np.ones((ksim,1)),D)),x,axis=1)
And I am getting the following error:
ValueError: object too deep for desired array
Can somebody guide me please?
So, you get the error on the line x=...
The shapes of your parameters are numerator: (100,2), denominator: (100,3) and data: (100,10000). The problem you are having is that lfilter expects to use the same filter for all items it processes, i.e. it only accepts 1-d vectors for the nominator and denominator.
It seems that you really need to turn that into a loop along the rows. Something like this:
# denom_array: R different denominators in an array with R rows
# numer_array: R different numerators in an array with R rows
# data: R data vectors in an array with R rows
# out_sig: output signal
out_sig = array([ scipy.signal.lfilter(denom_array[n], numer_array[n], data[n]) for n in range(data.shape[0])] )
See http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.lfilter.html for more information on what lifter expects.
(But don't worry, the performance hit is minimal, most time is spent filtering anyway.)
I want to calculate the standard deviation for values below and above the average of a matrix of n_par parameters and n_sample samples. The fastest way I found so far is:
stdleft = numpy.zeros_like(mean)
for jpar in xrange(mean.shape[1]):
stdleft[jpar] = p[p[:,jpar] < \
mean[jpar],jpar].std()
where p is a matrix like (n_samples,n_par). Is there a smarter way to do it without the for loop? I have roughly n_par = 200 and n_samples = 1e8 and therefore these three lines take ages to be performed.
Any idea would be really helpfull!
Thank you
As I understand it, you want to calculate the standard deviation of each column where the values are below the mean for that column.
In numpy, it's easiest to use masked arrays for this.
As an example:
import numpy as np
# 10 samples, 3 columns
p = np.random.random((10, 3))
# Calculate the mean of each column
colmeans = p.mean(axis=0)
# Make a boolean array where our condition is True
mask = p < colmeans
# Find the standard deviation of values in each column below the column's mean.
# For masked arrays, the True values will be masked, so we'll invert the array.
stdleft = np.ma.masked_where(~mask, p).std(axis=0)
You can also use pandas for this as #SudeepJuvekar mentioned. The performance should be broadly similar, but pandas should be a bit faster for this particular operation (untested).
Pandas is your friend. Convert your matrix in pandas Dataframe and index the Dataframe logically. Something like this
mat = pandas.DataFrame(p)
This creates a DataFrame from original numpy matrix p. Then we compute the column means for the DataFrame.
m = mat.mean()
Creates n_par sized array of all column means of mat. Finally, index the mat matrix using < logical operation and apply std to that.
stdleft = mat[mat < m].std()
Similarly for stdright. Take a couple of minutes to compute on my machine.
Here's the doc page for pandas: http://pandas.pydata.org/
Edit: Edited using the comment below. You can do almost similar indexing using the original p.
m = p.mean(axis=0)
logical = p < m
logical contains a boolean matrix of same size as p. This is where pandas comes handy. You can directly index a pandas matrix using logical of same size. Doing so in numpy is slightly hard. I guess looping is the best way to achieve it?
for i in range(len(p)):
stdleft[i] = p[logical[:, i], i].std()