Related
I have a list of numpy arrays and want to remove duplicates and also keep the order of my sorted data. This is my array with duplicates:
dup_arr=[np.array([[0., 10., 10.],\
[0., 2., 30.],\
[0., 3., 5.],\
[0., 3., 5.],\
[0., 3., 40.]]),\
np.array([[0., -1., -4.],\
[0., -2., -3.],\
[0., -3., -5.],\
[0., -3., -6.],\
[0., -3., -6.]])]
I tried to do it using the following code:
clean_arr=[]
for i in dup_arr:
new_array = [tuple(row) for row in i]
uniques = np.unique(new_array, axis=0)
clean_arr.append(uniques)
But the problem of this method is that it changes the sort of my data and I do not want to to sort them again because it is a tough task for my real data. I want to have the following result:
clean_arr=[np.array([[0., 10., 10.],\
[0., 2., 30.],\
[0., 3., 5.],\
[0., 3., 40.]]),\
np.array([[0., -1., -4.],\
[0., -2., -3.],\
[0., -3., -5.],\
[0., -3., -6.]])]
But the code shuffle it. I also tried the foolowing for loops but it was not also successful because I can not iterate until the end of my data and stop the second for loop before reaching to the end of each array of my list.
clean_arr=[]
for arrays in dup_arr:
for rows in range (len(arrays)-1):
if np.all(arrays [rows]== arrays [rows+1]):
continue
else:
dat= arrays [rows]
clean_arr.append(dat)
In advance, I do appreciate any help and contribution.
You can simply use np.unique with axis=0. If you want to keep the order from the original sequence try this -
[i[np.sort(np.unique(i, axis=0, return_index=True)[1])] for i in dup_arr]
[array([[ 0., 10., 10.],
[ 0., 2., 30.],
[ 0., 3., 5.],
[ 0., 3., 40.]]),
array([[ 0., -1., -4.],
[ 0., -2., -3.],
[ 0., -3., -5.],
[ 0., -3., -6.]])]
np.unique(i, axis=0, return_index=True)[1] returns the indexes of the unique elements.
np.sort() sorts these indexes back to original sequence in array.
[f(i) for i in dup_arr] applies the above 2 steps over each element in dup_arr.
NOTE: You will NOT be able to completely vectorize this operation (say by np.stack on this operations since it will may have variable duplicates removed from each matrix. This will cause the numpy array to have unequal shapes over an axis.
Breaking the steps as a function -
def f(a):
indexes = np.unique(a, axis=0, return_index=True)[1]
return a[np.sort(indexes)]
[f(i) for i in dup_arr]
In my humble opinion, every time you want to remove duplicates from an array or list in Python, you should consider using a set.
Also, try to avoid using multiple nested loops since errors easily occur and they're hard to find. I suggest you give the following code a try:
removed_duplicates=[]
for subarr in dup_arr:
removed_duplicates.append(np.array([list(item) for item in set(tuple(row) for row in subarr)]))
Basically what's happening is that you convert your array to a tuple, then to a set that removes all duplicates, and then to a list. Since your original data had an array of np.arrays, your convert the list back to a np.array before you append it to the new array.
Would this work?
I am trying to replace/overwrite values in a array using the following commands:
import numpy as np
test = np.array([[4,5,0],[0,0,0],[0,0,6]])
test
Out[20]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
test[np.where(test[...,0] != 0)][...,1:3] = np.array([[10,11]])
test
Out[22]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
However, as one can see in Out22, the array test has not been modified. So I am concluding that it is not possible to simply overwrite a part of a array or just few cells.
Nevertheless, in other contexts, it is possible to overwrite few cells of a array. For example, in the below code:
test = np.array([[1,2,0],[0,0,0],[0,0,3]])
test
Out[11]:
array([[1., 2., 0.],
[0., 0., 0.],
[0., 0., 3.]])
test[test>0]
Out[12]:
array([1., 2., 3.])
test[test>0] = np.array([4,5,6])
test
Out[14]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
Therefore, my 2 questions:
1- Why does the first command
test[np.where(test[...,0] != 0)][...,1:3] = np.array([10,11])
does not allow modifying the array test ? Why does not it allow accessing the array cells and overwrite them?
2- How could I make it work considering that for my code I would need to select the cells using the command above?
Many thanks!
I'll do you one up. This does work:
test[...,1:3][np.where(test[...,0] != 0)] = np.array([[10,11]])
array([[ 4, 10, 11],
[ 0, 0, 0],
[ 0, 0, 6]])
Why? It's the combination of two factors - numpy indexing and .__setitem__ calls.
The python interpreter sort of reads lines backwards. And when it gets to =, it tries to call .__setitem__ on the furthest thing to the left. __setitem__ is (hopefully) a method of the object, and has two inputs, the target and the indices (whatever is between [...] just before it).
a[b] = c #is intepreted as
a.__setitem__(b, c)
Now, when we index in numpy we have three basic ways we can do it.
slicing (returns views)
'advanced indexing' (returns copies)
'simple indexing' (also returns copies)
One major difference between "advanced" and "simple" indexing is that a numpy array's __setitem__ function can interpret advanced indexes. And views mean the data addresses are the same, so we don't need __setitem__ to get to them.
So:
test[np.where(test[...,0] != 0)][...,1:3] = np.array([[10,11]]) #is intepreted as
(test[np.where(test[...,0] != 0)]).__setitem__( slice([...,1:3]),
np.array([[10,11]]))
But, since np.where(test[...,0] != 0) is an advanced index, (test[np.where(test[...,0] != 0)]) returns a copy, which is then lost because it is never assigned. It does take the elements we want and set them to [10,11], but the result is lost in the buffer somewhere.
If we do:
test[..., 1:3][np.where(test[..., 0] != 0)] = np.array([[10, 11]]) #is intepreted as
(test[..., 1:3]).__setitem__( np.where(test[...,0] != 0), np.array([[10,11]]) )
test[...,1:3] is a view, so it still points to the same memory. Now setitem looks for the locations in test[...,1:3] that correspond to np.where(test[...,0] != 0), and set them equal to np.array([[10,11]]). And everything works.
You can also do this:
test[np.where(test[...,0] != 0), 1:3] = np.array([10, 11])
Now, since all the indexing is in one set of brackets, it's calling test.__setitem__ on those indices, which sets the data correctly as well.
Even simpler (and most pythonic) would be:
test[test[...,0] != 0, 1:3] = np.array([10,11])
Python 3.x
I have for loop which is making some calculations and creating one Slice/2D Array lets say (x = 3, y = 3) per iteration and I want at the same time in the same for loop (append?/stack) them in a third dimension.
I have been trying with Numpy stack, vstack, hstack, dstack but I still don't get how to get them together in the 3rd dimension as I want.
So I would like to have at them end something like this:
(z = 10, x = 3, y = 3)
array([ [[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]],
[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]],
[[2., 2., 2.],
[2., 2., 2.],
[2., 2., 2.]],
.
.
.
])
Thanks,
you can do it like this
arrays = []
for i in range(5):
arr = np.full((3,3), i)
arrays.append(arr)
np.asarray(arrays)
If you want to you can do np.asarray(arrays) inside loop. But it will be not very efficient. Not that np.concatenate will also effectively creates new numpy array so efficiency will be similar. Doing these operation once outside the loop is better
I am trying to create a lat/lon grid that contains an array of found indices where two conditions are met for a lat/lon combination. This approach might be too complicated, but using a meshgrid or numpy broadcasting failed also. If there is a better approach, feel free to share your knowlegde. :-)
Round lat/lon values to gridsize resolution of 1° but retain full length of array:
x = np.around(lon, decimals=0)
y = np.around(lat, decimals=0)
arrays consists of longitude/latitude values from -180 to 180 and -82° to 82°; multiple douplets possible
Check for each combination of lat/lon how many measurements are available for 1°/1° grid point:
a = arange(-180,181)
b = arange(-82,83)
totalgrid = [ [ 0 for i in range(len(b)) ] for j in range(len(a)) ]
for d1 in range(len(a)):
for d2 in range(len(b)):
totalgrid[d1][d2]=np.where((x==a[d1])&(y==b[d2]))[0]
This method fails and returns only a list of lists with empty arrays. I can't figure out why it's not working properly.
Replacing the last line by:
totalgrid[d1][d2]=np.where((x==a[0])&(y==b[0]))[0]
returns all found indices from lon/lat that are present at -180°/-82°. Unfortunately it takes a while. Am I missing a for loop somewhere?!
The Problem in more detail:
#askewchan
Unfortunately this one does not solve my original problem.
As expected the result represents the groundtrack quite well.
But besides the fact that I need the total number of points for each grid point, I also need each single index of lat/lon combinations in the lat/lon array for further computations.
Let's assume I have an array
lat(100000L,), lon(100000L,) and a third one array(100000L,)
which corresponds to the measurement at each point. I need every index of all 1°/1° combinations in lat/lon, to check this index in the array(100000L,) if a condition is met. Now lets assume that the indices[10000,10001,10002,..,10025] of lat/lon are on the same gridpoint. For those indices I need to check whether array[10000,10001,10002,..,10025] now met a condition, i.e. np.where(array==0). With cts.nonzero() I only get the index in the histogram. But then all information of each point contributing to the value of the histogram is lost. Hopefully you get what was my initial problem.
Not sure if I understand the goal here, but you want to count how many lat/lon pairs you have in each 1° section? This is what a histogram does:
lon = np.random.random(5000)*2*180 - 180
lat = np.random.random(5000)*2*82 - 82
a = np.arange(-180,181)
b = np.arange(-82,83)
np.histogram2d(lon, lat, (a,b))
#(array([[ 0., 0., 1., ..., 0., 0., 0.],
# [ 0., 2., 0., ..., 0., 0., 1.],
# [ 0., 0., 0., ..., 0., 1., 0.],
# ...,
# [ 0., 1., 0., ..., 0., 0., 0.],
# [ 0., 0., 0., ..., 0., 0., 0.],
# [ 0., 0., 0., ..., 0., 0., 0.]]),
The indices where you have a nonzero count would be at:
cts.nonzero()
#(array([ 0, 0, 0, ..., 359, 359, 359]),
# array([ 2, 23, 25, ..., 126, 140, 155]))
You can plot it too:
cts, xs, ys = np.histogram2d(lon, lat, (a,b))
pyplot.imshow(cts, extent=(-82,82,-180,180))
I need to calculate n number of points(3D) with equal spacing along a defined line(3D).
I know the starting and end point of the line. First, I used
for k in range(nbin):
step = k/float(nbin-1)
bin_point.append(beam_entry+(step*(beamlet_intersection-beam_entry)))
Then, I found that using append for large arrays takes more time, then I changed code like this:
bin_point = [start_point+((k/float(nbin-1))*(end_point-start_point)) for k in range(nbin)]
I got a suggestion that using newaxis will further improve the time.
The modified code looks like this.
step = arange(nbin) / float(nbin-1)
bin_point = start_point + ( step[:,newaxis,newaxis]*((end_pint - start_point))[newaxis,:,:] )
But, I could not understand the newaxis function, I also have a doubt that, whether the same code will work if the structure or the shape of the start_point and end_point are changed. Similarly how can I use the newaxis to mdoify the following code
for j in range(32): # for all los
line_dist[j] = sqrt([sum(l) for l in (end_point[j]-start_point[j])**2])
Sorry for being so clunky, to be more clear the structure of the start_point and end_point are
array([ [[1,1,1],[],[],[]....[]],
[[],[],[],[]....[]],
[[],[],[],[]....[]]......,
[[],[],[],[]....[]] ])
Explanation of the newaxis version in the question: these are not matrix multiplies, ndarray multiply is element-by-element multiply with broadcasting. step[:,newaxis,newaxis] is num_steps x 1 x 1 and point[newaxis,:,:] is 1 x num_points x num_dimensions. Broadcasting together ndarrays with shape (num_steps x 1 x 1) and (1 x num_points x num_dimensions) will work, because the broadcasting rules are that every dimension should be either 1 or the same; it just means "repeat the array with dimension 1 as many times as the corresponding dimension of the other array". This results in an ndarray with shape (num_steps x num_points x num_dimensions) in a very efficient way; the i, j, k subscript will be the k-th coordinate of the i-th step along the j-th line (given by the j-th pair of start and end points).
Walkthrough:
>>> start_points = numpy.array([[1, 0, 0], [0, 1, 0]])
>>> end_points = numpy.array([[10, 0, 0], [0, 10, 0]])
>>> steps = numpy.arange(10)/9.0
>>> start_points.shape
(2, 3)
>>> steps.shape
(10,)
>>> steps[:,numpy.newaxis,numpy.newaxis].shape
(10, 1, 1)
>>> (steps[:,numpy.newaxis,numpy.newaxis] * start_points).shape
(10, 2, 3)
>>> (steps[:,numpy.newaxis,numpy.newaxis] * (end_points - start_points)) + start_points
array([[[ 1., 0., 0.],
[ 0., 1., 0.]],
[[ 2., 0., 0.],
[ 0., 2., 0.]],
[[ 3., 0., 0.],
[ 0., 3., 0.]],
[[ 4., 0., 0.],
[ 0., 4., 0.]],
[[ 5., 0., 0.],
[ 0., 5., 0.]],
[[ 6., 0., 0.],
[ 0., 6., 0.]],
[[ 7., 0., 0.],
[ 0., 7., 0.]],
[[ 8., 0., 0.],
[ 0., 8., 0.]],
[[ 9., 0., 0.],
[ 0., 9., 0.]],
[[ 10., 0., 0.],
[ 0., 10., 0.]]])
As you can see, this produces the correct answer :) In this case broadcasting (10,1,1) and (2,3) results in (10,2,3). What you had is broadcasting (10,1,1) and (1,2,3) which is exactly the same and also produces (10,2,3).
The code for the distance part of the question does not need newaxis: the inputs are num_points x num_dimensions, the ouput is num_points, so one dimension has to be removed. That is actually the axis you sum along. This should work:
line_dist = numpy.sqrt( numpy.sum( (end_point - start_point) ** 2, axis=1 )
Here numpy.sum(..., axis=1) means sum along that axis only, rather than all elements: a ndarray with shape num_points x num_dimensions summed along axis=1 produces a result with num_points, which is correct.
EDIT: removed code example without broadcasting.
EDIT: fixed up order of indexes.
EDIT: added line_dist
I'm not through understanding all you wrote, but some things I already can tell you; maybe they help.
newaxis is rather a marker than a function (in fact, it is plain None). It is used to add an (unused) dimension to a multi-dimensional value. With it you can make a 3D value out of a 2D value (or even more). Each dimension already there in the input value must be represented by a colon : in the index (assuming you want to use all values, otherwise it gets complicated beyond our usecase), the dimensions to be added are denoted by newaxis.
Example:
input is a one-dimensional vector (1D): 1,2,3
output shall be a matrix (2D).
There are two ways to accomplish this; the vector could fill the lines with one value each, or the vector could fill just the first and only line of the matrix. The first is created by vector[:,newaxis], the second by vector[newaxis,:]. Results of this:
>>> array([ 7,8,9 ])[:,newaxis]
array([[7],
[8],
[9]])
>>> array([ 7,8,9 ])[newaxis,:]
array([[7, 8, 9]])
(Dimensions of multi-dimensional values are represented by nesting of arrays of course.)
If you have more dimensions in the input, use the colon more than once (otherwise the deeper nested dimensions are simply ignored, i.e. the arrays are treated as simple values). I won't paste a representation of this here as it won't clarify things due to the optical complexity when 3D and 4D values are written on a 2D display using nested brackets. I hope it gets clear anyway.
The newaxis reshapes the array in such a way so that when you multiply numpy uses broadcasting. Here is a good tutorial on broadcasting.
step[:, newaxis, newaxis] is the same as step.reshape((step.shape[0], 1, 1)) (if step is 1d). Either method for reshaping should be very fast because reshaping arrays in numpy is very cheep, it just makes a view of the array, especially because you should only be doing it once.