Right now I am doing this by iterating, but there has to be a way to accomplish this task using numpy functions. My goal is to take a 2D array and average J columns at a time, producing a new array with the same number of rows as the original, but with columns/J columns.
So I want to take this:
J = 2 // two columns averaged at a time
[[1 2 3 4]
[4 3 7 1]
[6 2 3 4]
[3 4 4 1]]
and produce this:
[[1.5 3.5]
[3.5 4.0]
[4.0 3.5]
[3.5 2.5]]
Is there a simple way to accomplish this task? I also need a way such that if I never end up with an unaveraged remainder column. So if, for example, I have an input array with 5 columns and J=2, I would average the first two columns, then the last three columns.
Any help you can provide would be great.
data.reshape(-1,j).mean(axis=1).reshape(data.shape[0],-1)
If your j divides data.shape[1], that is.
Example:
In [40]: data
Out[40]:
array([[7, 9, 7, 2],
[7, 6, 1, 5],
[8, 1, 0, 7],
[8, 3, 3, 2]])
In [41]: data.reshape(-1,j).mean(axis=1).reshape(data.shape[0],-1)
Out[41]:
array([[ 8. , 4.5],
[ 6.5, 3. ],
[ 4.5, 3.5],
[ 5.5, 2.5]])
First of all, it looks to me like you're not averaging the columns at all, you're just averaging two data points at a time. Seems to me like your best off reshaping the array, so your that you have a Nx2 data structure that you can feed directly to mean. You may have to pad it first if the number of columns isn't quite compatible. Then at the end, just do a weighted average of the padded remainder column and the one before it. Finally reshape back to the shape you want.
To play off of the example provided by TheodrosZelleke:
In [1]: data = np.concatenate((data, np.array([[5, 6, 7, 8]]).T), 1)
In [2]: data
Out[2]:
array([[7, 9, 7, 2, 5],
[7, 6, 1, 5, 6],
[8, 1, 0, 7, 7],
[8, 3, 3, 2, 8]])
In [3]: cols = data.shape[1]
In [4]: j = 2
In [5]: dataPadded = np.concatenate((data, np.zeros((data.shape[0], j - cols % j))), 1)
In [6]: dataPadded
Out[6]:
array([[ 7., 9., 7., 2., 5., 0.],
[ 7., 6., 1., 5., 6., 0.],
[ 8., 1., 0., 7., 7., 0.],
[ 8., 3., 3., 2., 8., 0.]])
In [7]: dataAvg = dataPadded.reshape((-1,j)).mean(axis=1).reshape((data.shape[0], -1))
In [8]: dataAvg
Out[8]:
array([[ 8. , 4.5, 2.5],
[ 6.5, 3. , 3. ],
[ 4.5, 3.5, 3.5],
[ 5.5, 2.5, 4. ]])
In [9]: if cols % j:
dataAvg[:, -2] = (dataAvg[:, -2] * j + dataAvg[:, -1] * (cols % j)) / (j + cols % j)
dataAvg = dataAvg[:, :-1]
....:
In [10]: dataAvg
Out[10]:
array([[ 8. , 3.83333333],
[ 6.5 , 3. ],
[ 4.5 , 3.5 ],
[ 5.5 , 3. ]])
Related
how can I stack the elements from the same respective index from each array in a list of arrays?
arrays = [np.array([1,2,3,4,5]),
np.array([6,7,8,9]),
np.array([11,22,33,44,55]),
np.array([2,4])]
output = [[1,6,11,2],
[2,7,22,4],
[3,8,33],
[4,9,44],
[5,55]]
arrays is a list of arrays of uneven lengths. The output has a first array (don't mind if it's a list too) that contains all possible index 0s from each array. The next array within output contains all possible index 1s and so on...
Closest thing I can find (but requires same shape arrays) is:
a = np.array([1, 2, 3])
b = np.array([2, 3, 4])
np.stack((a, b), axis=-1)
# which gives
array([[1, 2],
[2, 3],
[3, 4]])
Thanks.
This gets you close. You can't really have a 2D sparse array as shown in your example output.
import numpy as np
arrays = [np.array([1,2,3,4,5]),
np.array([6,7,8,9]),
np.array([11,22,33,44,55]),
np.array([2,4])]
maxx = max(x.shape[0] for x in arrays)
for x in arrays:
x.resize(maxx,refcheck=False)
output = np.stack(arrays, axis=1)
print(output)
C:\tmp>python x.py
[[ 1 6 11 2]
[ 2 7 22 4]
[ 3 8 33 0]
[ 4 9 44 0]
[ 5 0 55 0]]
You could just wrap it in a DataFrame first:
arr = pd.DataFrame(arrays).values.T
Output:
array([[ 1., 6., 11., 2.],
[ 2., 7., 22., 4.],
[ 3., 8., 33., nan],
[ 4., 9., 44., nan],
[ 5., nan, 55., nan]])
Though if you really want it with different sizes, go with:
arr = [x.dropna().values for _, x in pd.DataFrame(arrays).iteritems()]
Output:
[array([ 1, 6, 11, 2]),
array([ 2, 7, 22, 4]),
array([ 3., 8., 33.]),
array([ 4., 9., 44.]),
array([ 5., 55.])]
I am receiving the right answer when I compute the Vandermonde
coefficients of this matrix. However, the output matrix is reversed.
It should be [6,-39,55,27] instead of [27,55,-39,6].
My output for my Vandermonde Matrix is flipped and the final solution
c, is flipped.
import numpy as np
from numpy import linalg as LA
x = np.array([[4],[2],[0],[-1]])
f = np.array([[7],[29],[27],[-73]])
def main():
A_matrix = VandermondeMatrix(x)
print(A_matrix)
c = LA.solve(A_matrix,f) #coefficients of Vandermonde Polynomial
print(c)
def VandermondeMatrix(x):
n = len(x)
A = np.zeros((n, n))
exponent = np.array(range(0,n))
for j in range(n):
A[j, :] = x[j]**exponent
return A
if __name__ == "__main__":
main()
Just make the exponent range the other way around from the beginning, then you don't have to flip afterwards reducing runtime:
def VandermondeMatrix(x):
n = len(x)
A = np.zeros((n, n))
exponent = np.array(range(n-1,-1,-1))
for j in range(n):
A[j, :] = x[j]**exponent
return A
Out:
#A_matrix:
[[64. 16. 4. 1.]
[ 8. 4. 2. 1.]
[ 0. 0. 0. 1.]
[-1. 1. -1. 1.]]
#c:
[[ 6.]
[-39.]
[ 55.]
[ 27.]]
np.flip(c)?
link to documentation
You could do
print(c[::-1])
which will reverse the order of c.
From How can I flip the order of a 1d numpy array?
There is a parameter that does exactly that: increasing=True
Example from the documentation:
x = np.array([1, 2, 3, 5])
np.vander(x)
array([[ 1, 1, 1, 1],
[ 8, 4, 2, 1],
[ 27, 9, 3, 1],
[125, 25, 5, 1]])
np.vander(x, increasing=True)
array([[ 1, 1, 1, 1],
[ 1, 2, 4, 8],
[ 1, 3, 9, 27],
[ 1, 5, 25, 125]])
In [3]: def VandermondeMatrix(x):
...: n = len(x)
...: A = np.zeros((n, n))
...: exponent = np.array(range(0,n))
...: for j in range(n):
...: A[j, :] = x[j]**exponent
...: return A
...:
In [4]: x = np.array([[4],[2],[0],[-1]])
In [5]: VandermondeMatrix(x)
Out[5]:
array([[ 1., 4., 16., 64.],
[ 1., 2., 4., 8.],
[ 1., 0., 0., 0.],
[ 1., -1., 1., -1.]])
In [6]: f = np.array([[7],[29],[27],[-73]])
In [7]: np.linalg.solve(_5,f)
Out[7]:
array([[ 27.],
[ 55.],
[-39.],
[ 6.]])
The result is a (4,1) array; reverse rows with:
In [9]: _7[::-1]
Out[9]:
array([[ 6.],
[-39.],
[ 55.],
[ 27.]])
Negative strides, [::-1] indexing is also used to reverse Python lists and strings.
In [10]: ['a','b','c'][::-1]
Out[10]: ['c', 'b', 'a']
I have a matrix M that is rather large. I am trying to find the top 5 closest distances along with their indices.
M = csr_matrix(M)
dst = pairwise_distances(M,Y=None,metric='euclidean')
dst becomes a huge matrix and I am trying to sort it efficiently or use scipy or sklearn to find the closest 5 distances.
Here is an example of what I am trying to do:
X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
I then calculate dst as:
[[ 0. 1. 3. 2. 1.]
[ 1. 0. 2. 3. 2.]
[ 3. 2. 0. 5. 4.]
[ 2. 3. 5. 0. 1.]
[ 1. 2. 4. 1. 0.]]
So, row 0 to itself has a distance of 0., row 0 to 1 has a distance of 1.,... row 2 to row 3 has a distance of 5., and so on. I want to find these closest 5 distances and put them in a list with the corresponding rows, maybe like [distance, row, row]. I don't want any diagonal elements or duplicate elements so I take the upper triangular matrix as follows:
[[ inf 1. 3. 2. 1.]
[ nan inf 2. 3. 2.]
[ nan nan inf 5. 4.]
[ nan nan nan inf 1.]
[ nan nan nan nan inf]]
Now, the top 5 distances least to greatest are:
[1, 0, 1], [1, 0, 4], [1, 3, 4], [2, 1, 2], [2, 0, 3], [2, 1, 4]
As you can see there are three elements that have distance 2 and three elements that have distance 1. From these I want to randomly choose one of the elements with distance 2 to keep as I only want the top f elements where f=5 in this case.
This is just a sample as this matrix could be very large. Is there an efficient way to do the above besides using a basic sorted function? I couldn't find any sklearn or scipy to help me with this.
Here's a fully vectorized solution to your problem:
import numpy as np
from scipy.spatial.distance import pdist
def smallest(M, f):
# compute the condensed distance matrix
dst = pdist(M, 'euclidean')
# indices of the upper triangular matrix
rows, cols = np.triu_indices(M.shape[0], k=1)
# indices of the f smallest distances
idx = np.argsort(dst)[:f]
# gather results in the specified format: distance, row, column
return np.vstack((dst[idx], rows[idx], cols[idx])).T
Notice that np.argsort(dst)[:f] yields the indices of the smallest f elements of the condensed distance matrix dst sorted in ascending order.
The following demo reproduces the result of your toy example and shows how the function smallest deals with a fairly large matrix of integers:
In [59]: X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
In [60]: smallest(X, 5)
Out[60]:
array([[ 1., 0., 1.],
[ 1., 0., 4.],
[ 1., 3., 4.],
[ 2., 0., 3.],
[ 2., 1., 2.]])
In [61]: large_X = np.random.randint(100, size=(10000, 2000))
In [62]: large_X
Out[62]:
array([[ 8, 78, 97, ..., 23, 93, 90],
[42, 2, 21, ..., 68, 45, 62],
[28, 45, 30, ..., 0, 75, 48],
...,
[26, 88, 78, ..., 0, 88, 43],
[91, 53, 94, ..., 85, 44, 37],
[39, 8, 10, ..., 46, 15, 67]])
In [63]: %time smallest(large_X, 5)
Wall time: 1min 32s
Out[63]:
array([[ 1676.12529365, 4815. , 5863. ],
[ 1692.97253374, 1628. , 2950. ],
[ 1693.558384 , 5742. , 8240. ],
[ 1695.86408654, 2140. , 6969. ],
[ 1696.68853948, 5477. , 6641. ]])
I am trying to figure out how to take the following for loop that splits an array based on the index of the lowest value in the row and use vectorization. I've looked at this link and have been trying to use the numpy.where function but currently unsuccessful.
For example if an array has n columns, then all the rows where col[0] has the lowest value are put in one array, all the rows where col[1] are put in another, etc.
Here's the code using a for loop.
import numpy
a = numpy.array([[ 0. 1. 3.]
[ 0. 1. 3.]
[ 0. 1. 3.]
[ 1. 0. 2.]
[ 1. 0. 2.]
[ 1. 0. 2.]
[ 3. 1. 0.]
[ 3. 1. 0.]
[ 3. 1. 0.]])
result_0 = []
result_1 = []
result_2 = []
for value in a:
if value[0] <= value[1] and value[0] <= value[2]:
result_0.append(value)
elif value[1] <= value[0] and value[1] <= value[2]:
result_1.append(value)
else:
result_2.append(value)
print(result_0)
>>[array([ 0. 1. 3.]), array([ 0. 1. 3.]), array([ 0. 1. 3.])]
print(result_1)
>>[array([ 1. 0. 2.]), array([ 1. 0. 2.]), array([ 1. 0. 2.])]
print(result_2)
>>[array([ 3. 1. 0.]), array([ 3. 1. 0.]), array([ 3. 1. 0.])]
First, use argsort to see where the lowest value in each row is:
>>> a.argsort(axis=1)
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2],
[1, 0, 2],
[1, 0, 2],
[1, 0, 2],
[2, 1, 0],
[2, 1, 0],
[2, 1, 0]])
Note that wherever a row has 0, that is the smallest column in that row.
Now you can build the results:
>>> sortidx = a.argsort(axis=1)
>>> [a[sortidx[:,i] == 0] for i in range(a.shape[1])]
[array([[ 0., 1., 3.],
[ 0., 1., 3.],
[ 0., 1., 3.]]),
array([[ 1., 0., 2.],
[ 1., 0., 2.],
[ 1., 0., 2.]]),
array([[ 3., 1., 0.],
[ 3., 1., 0.],
[ 3., 1., 0.]])]
So it is done with only a single loop over the columns, which will give a huge speedup if the number of rows is much larger than the number of columns.
This is not the best solution since it relies on simple python loops and is not very efficient when you start dealing with large data sets but it should get you started.
The point is to create an array of "buckets" which store the data based on the depth of the lengthiest element. Then enumerate each element in values, selecting the smallest one and saving its offset which is subsequently appended to the correct results "bucket", for each a. Finally we print this out in the last loop.
Solution using loops:
import numpy
import pprint
# random data set
a = numpy.array([[0, 1, 3],
[0, 1, 3],
[0, 1, 3],
[1, 0, 2],
[1, 0, 2],
[1, 0, 2],
[3, 1, 0],
[3, 1, 0],
[3, 1, 0]])
# create a list of results as big as the depth of elements in an entry
results = list()
for l in range(max(len(i) for i in a)):
results.append(list())
# don't do the following because all the references to the lists will be the same and you get dups:
# results = [[]]*max(len(i) for i in a)
for value in a:
res_offset, _val = min(enumerate(value), key=lambda x: x[1]) # get the offset and min value
results[res_offset].append(value) # store the original Array obj in the correct "bucket"
# print for visualization
for c, r in enumerate(results):
print("result_%s: %s" % (c, r))
Outputs:
result_0: [array([0, 1, 3]), array([0, 1, 3]), array([0, 1, 3])]
result_1: [array([1, 0, 2]), array([1, 0, 2]), array([1, 0, 2])]
result_2: [array([3, 1, 0]), array([3, 1, 0]), array([3, 1, 0])]
I found a much easier way to do this. I hope that I am interpreting the OP correctly.
My sense is that the OP wants to create a slice of the larger array based upon some set of conditions.
Note that the code above to create the array does not seem to work--at least in python 3.5. I generated the array as follow.
a = np.array([0., 1., 3., 0., 1., 3., 0., 1., 3., 1., 0., 2., 1., 0., 2.,1., 0., 2.,3., 1., 0.,3., 1., 0.,3., 1., 0.]).reshape([9,3])
Next, I sliced the original array into smaller arrays. Numpy has builtins to help with this.
result_0 = a[np.logical_and(a[:,0] <= a[:,1],a[:,0] <= a[:,2])]
result_1 = a[np.logical_and(a[:,1] <= a[:,0],a[:,1] <= a[:,2])]
result_2 = a[np.logical_and(a[:,2] <= a[:,0],a[:,2] <= a[:,1])]
This will generate new numpy arrays that match the given conditions.
Note if the user wants to convert these individual rows into a list or arrays, he/she can just enter the following code to obtain the result.
result_0 = [np.array(x) for x in result_0.tolist()]
result_0 = [np.array(x) for x in result_1.tolist()]
result_0 = [np.array(x) for x in result_2.tolist()]
This should generate the outcome requested in the OP.
How would I do the following:
With a 3D numpy array I want to take the mean in one dimension and assign the values back to a 3D array with the same shape, with duplicate values of the means in the direction they were derived...
I'm struggling to work out an example in 3D but in 2D (4x4) it would look a bit like this I guess
array[[1, 1, 2, 2]
[2, 2, 1, 0]
[1, 1, 2, 2]
[4, 8, 3, 0]]
becomes
array[[2, 3, 2, 1]
[2, 3, 2, 1]
[2, 3, 2, 1]
[2, 3, 2, 1]]
I'm struggling with the np.mean and the loss of dimensions when take an average.
You can use the keepdims keyword argument to keep that vanishing dimension, e.g.:
>>> a = np.random.randint(10, size=(4, 4)).astype(np.double)
>>> a
array([[ 7., 9., 9., 7.],
[ 7., 1., 3., 4.],
[ 9., 5., 9., 0.],
[ 6., 9., 1., 5.]])
>>> a[:] = np.mean(a, axis=0, keepdims=True)
>>> a
array([[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ]])
You can resize the array after taking the mean:
In [24]: a = np.array([[1, 1, 2, 2],
[2, 2, 1, 0],
[2, 3, 2, 1],
[4, 8, 3, 0]])
In [25]: np.resize(a.mean(axis=0).astype(int), a.shape)
Out[25]:
array([[2, 3, 2, 0],
[2, 3, 2, 0],
[2, 3, 2, 0],
[2, 3, 2, 0]])
In order to correctly satisfy the condition that duplicate values of the means appear in the direction they were derived, it's necessary to reshape the mean array to a shape which is broadcastable with the original array.
Specifically, the mean array should have the same shape as the original array except that the length of the dimension along which the mean was taken should be 1.
The following function should work for any shape of array and any number of dimensions:
def fill_mean(arr, axis):
mean_arr = np.mean(arr, axis=axis)
mean_shape = list(arr.shape)
mean_shape[axis] = 1
mean_arr = mean_arr.reshape(mean_shape)
return np.zeros_like(arr) + mean_arr
Here's the function applied to your example array which I've called a:
>>> fill_mean(a, 0)
array([[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75]])
>>> fill_mean(a, 1)
array([[ 1.5 , 1.5 , 1.5 , 1.5 ],
[ 1.25, 1.25, 1.25, 1.25],
[ 2. , 2. , 2. , 2. ],
[ 3.75, 3.75, 3.75, 3.75]])
Construct the numpy array
import numpy as np
data = np.array(
[[1, 1, 2, 2],
[2, 2, 1, 0],
[1, 1, 2, 2],
[4, 8, 3, 0]]
)
Use the axis parameter to get means along a particular axis
>>> means = np.mean(data, axis=0)
>>> means
array([ 2., 3., 2., 1.])
Now tile that resulting array into the shape of the original
>>> print np.tile(means, (4,1))
[[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]]
You can replace the 4,1 with parameters from data.shape