Related
What is the most efficient way to remove items from a list based on a function in python (and any common library)?
For example, if I have the following function:
def func(a):
return a % 2 == 1
And the following list:
arr = [1,4,5,8,20,24]
Then I would want the result:
new_arr = [1,5]
I know I could simply iterate over the list like such:
new_arr = [i for i in arr if func(a)]
Just wondering if this is an efficient approach (for large datasets), or if there might be a better way. I was thinking maybe using np to map and changing the function to return a if True and -1 if false, and then using np remove to remove all 0s?
Edit:
I decided to test it myself with the suggestions you all gave (I probably just should have tested runtime myself rather than being lazy and asking other people).
filter was by far the fastest alone. Though if you needed a list for random access it was comparable to the [i for i in arr if func(i)] method.
The numpy [np.vectorize(func)(arr)] was slightly faster than the other list methods.
This is the use case for the builtin filter function:
filtered_arr = filter(func, arr)
Note that this returns an iterator, not a list. If you want a list, you can create one with list(filtered_arr) or a list comprehension as you noted. But if you just want to iterate over the filtered items and don't need random access / indexing, it's more memory efficient to use the iterator.
This is a good general approach for filtering lists that are not especially large and contain elements with arbitrary (and possibly mixed) types. If you are working with a large amount of numerical data, you should use one of the NumPy solutions mentioned in other answers.
Since the numpy tag is present, let use it. In the example we use a mask for elements that give remainder 1 when divided by 2.
>>> import numpy as np
>>>
>>> arr = np.array([1,4,5,8,20,24])
>>> arr
array([ 1, 4, 5, 8, 20, 24])
>>> arr[arr % 2 == 1]
array([1, 5])
Using numpy you could use numpy.vectorize to map your function across your array elements, then use that to keep the elements that evaluate to True. So starting with your function
def func(a):
return a % 2 == 1
We could test keeping only the odd values from the range of [0, 19]
>>> import numpy as np
>>> arr = np.arange(20)
>>> arr
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> np.vectorize(func)(arr)
array([False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True])
>>> arr[np.vectorize(func)(arr)]
array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])
You could also rewrite func to make it take a list as an argument:
def func(lst):
lst = [x for x in lst if x % 2 == 1]
return lst
and then do:
new_arr = func(arr)
This would save you some lines in comparison to making func take a single number as argument and writing an iteration such as [for i in arr if func(a)] every time you want to use it on (elements of) a list.
I decided to test it myself with the suggestions you all gave (I probably just should have tested runtime myself rather than being lazy and asking other people).
filter was by far the fastest alone. Though if you needed a list for random access it was comparable to the [i for i in arr if func(i)] method.
The numpy [np.vectorize(func)(arr)] was slightly faster than the other methods.
Just wanted to post this in case anyone else runs across this in the future.
I have the following problem: I have index arrays with repeating indices and would like to add values to an array like this:
grid_array[xidx[:],yidx[:],zidx[:]] += data[:]
However, as I have repeated indices this does not work as it should because numpy will create a temporary array which results in the data for the repeated indices being assigned several times instead of being added to each other (see http://docs.scipy.org/doc/numpy/user/basics.indexing.html).
A for loop like
for i in range(0,n):
grid_array[xidx[i],yidx[i],zidx[i]] += data[i]
will be way to slow. Is there a way I can still use the vectorization of numpy? Or is there another way to make this assignment faster?
Thanks for your help
How about using bincount?
import numpy as np
flat_index = np.ravel_multi_index([xidx, yidx, zidx], grid_array.shape)
datasum = np.bincount(flat_index, data, minlength=grid_array.size)
grid_array += datasum.reshape(grid_array.shape)
This is a buffering issue. The .at provides unbuffered action
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.at.html#numpy.ufunc.at
np.add.at(grid_array, (xidx,yidx,zidx),data)
For add an array to elements of a nested array you just can do grid_array[::]+=data :
>>> grid_array=np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> data=np.array([3,3,3])
>>> grid_array[::]+=data
>>> grid_array
array([[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
I think I found a possible solution:
def assign(xidx,yidx,zidx,data):
grid_array[xidx,yidx,zidx] += data
return
map(assign,xidx,yidx,zidx,sn.part0['mass'])
I want to replace some elements of a list immediately.
Suppose we have these lists:
list = [1, 2, 3, 4, 5, 6]
idx = [1, 3, 4]
new = [100, 200, 300]
I want to replace elements 1, 3, 4 from list with new values. for example :
list[idx] = new
so the final list is => [1, 100, 3, 200, 300, 6]
I know you can use it in this way in Matlab, but want to know what should I do in Python?
Note : I know it's possible to use loops and do this.
Edit : I want to use a pure python solution.
The "pythonic" way would be to use zip:
for i, n in zip(idx, new):
L[i] = n
Python itself doesn't support matlab-style array operations, but you can look into numpy if you're interested in that coding style (see #abarnert's answer).
L = [1, 2, 3, 4, 5, 6]
idx = [1, 3, 4]
new = [100, 200, 300]
for i in range(len(idx)):
L[idx[i]] = new[i]
A slightly slower version without loops:
L = [1, 2, 3, 4, 5, 6]
idx = [1, 3, 4]
new = [100, 200, 300]
L = [num if i not in idx else new[i] for i,num in enumerate(L)]
Since you're looking for a Matlab-like solution, there's a good chance you should really be using NumPy here. In fact, if you do things that way, you can write exactly the code you wanted:
>>> import numpy as np
>>> a = np.array([1, 2, 3, 4, 5, 6])
>>> idx = [1, 3, 4]
>>> new = [100, 200, 300]
>>> a[idx] = new
>>> a
array([ 1, 100, 3, 200, 300, 6])
Besides giving you Matlab-ish element-wise operators and functions, NumPy also gives you convenient multi-dimensional arrays, and access to a huge library of higher-level functions (especially if you include adjunct libraries like SciPy). Plus you typically get performance benefits like, e.g., 6x speed and .25x space.
If you want a pure-Python solution, it's not that hard to implement this much of NumPy (or as much as you need) in Python. You can write your own Array class that emulates a container type in any way you want. In particular, note that a[idx] = new calls a.__setitem__(idx, new). You probably want to handle single numbers and slices the same way as list, but there's nothing at all stopping you from handling other types that list rejects. For example:
def __setitem__(self, idx, value):
if isinstance(idx, collections.abc.Iterable):
for i, v in zip(idx, value):
self.lst[i] = v
else:
self.lst[idx] = value
(You'd probably want to add a bit of error-handling for the case where idx and value have different lengths. You could work out the best rules from first principles, or start by looking at what NumPy does and just decide what you do and don't want to copy…)
Of course it's not an accident that the guts of this implementation will be code very much like alexis's answer, because all we're doing is wrapping up that logic so you only have to write it once, instead of every time you need it.
Is it possible to modify the numpy.random.choice function in order to make it return the index of the chosen element?
Basically, I want to create a list and select elements randomly without replacement
import numpy as np
>>> a = [1,4,1,3,3,2,1,4]
>>> np.random.choice(a)
>>> 4
>>> a
>>> [1,4,1,3,3,2,1,4]
a.remove(np.random.choice(a)) will remove the first element of the list with that value it encounters (a[1] in the example above), which may not be the chosen element (eg, a[7]).
Regarding your first question, you can work the other way around, randomly choose from the index of the array a and then fetch the value.
>>> a = [1,4,1,3,3,2,1,4]
>>> a = np.array(a)
>>> random.choice(arange(a.size))
6
>>> a[6]
But if you just need random sample without replacement, replace=False will do. Can't remember when it was firstly added to random.choice, might be 1.7.0. So if you are running very old numpy it may not work. Keep in mind the default is replace=True
Here's one way to find out the index of a randomly selected element:
import random # plain random module, not numpy's
random.choice(list(enumerate(a)))[0]
=> 4 # just an example, index is 4
Or you could retrieve the element and the index in a single step:
random.choice(list(enumerate(a)))
=> (1, 4) # just an example, index is 1 and element is 4
numpy.random.choice(a, size=however_many, replace=False)
If you want a sample without replacement, just ask numpy to make you one. Don't loop and draw items repeatedly. That'll produce bloated code and horrible performance.
Example:
>>> a = numpy.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> numpy.random.choice(a, size=5, replace=False)
array([7, 5, 8, 6, 2])
On a sufficiently recent NumPy (at least 1.17), you should use the new randomness API, which fixes a longstanding performance issue where the old API's replace=False code path unnecessarily generated a complete permutation of the input under the hood:
rng = numpy.random.default_rng()
result = rng.choice(a, size=however_many, replace=False)
This is a bit in left field compared with the other answers, but I thought it might help what it sounds like you're trying to do in a slightly larger sense. You can generate a random sample without replacement by shuffling the indices of the elements in the source array :
source = np.random.randint(0, 100, size=100) # generate a set to sample from
idx = np.arange(len(source))
np.random.shuffle(idx)
subsample = source[idx[:10]]
This will create a sample (here, of size 10) by drawing elements from the source set (here, of size 100) without replacement.
You can interact with the non-selected elements by using the remaining index values, i.e.:
notsampled = source[idx[10:]]
Maybe late but it worth to mention this solution because I think the simplest way to do so is:
a = [1, 4, 1, 3, 3, 2, 1, 4]
n = len(a)
idx = np.random.choice(list(range(n)), p=np.ones(n)/n)
It means you are choosing from the indices uniformly. In a more general case, you can do a weighted sampling (and return the index) in this way:
probs = [.3, .4, .2, 0, .1]
n = len(a)
idx = np.random.choice(list(range(n)), p=probs)
If you try to do so for so many times (e.g. 1e5), the histogram of the chosen indices would be like [0.30126 0.39817 0.19986 0. 0.10071] in this case which is correct.
Anyway, you should choose from the indices and use the values (if you need) as their probabilities.
Instead of using choice, you can also simply random.shuffle your array, i.e.
random.shuffle(a) # will shuffle a in-place
Based on your comment:
The sample is already a. I want to work directly with a so that I can control how many elements are still left and perform other operations with a. – HappyPy
it sounds to me like you're interested in working with a after n randomly selected elements are removed. Instead, why not work with N = len(a) - n randomly selected elements from a? Since you want them to still be in the original order, you can select from indices like in #CTZhu's answer, but then sort them and grab from the original list:
import numpy as np
n = 3 #number to 'remove'
a = np.array([1,4,1,3,3,2,1,4])
i = np.random.choice(np.arange(a.size), a.size-n, replace=False)
i.sort()
a[i]
#array([1, 4, 1, 3, 1])
So now you can save that as a again:
a = a[i]
and work with a with n elements removed.
Here is a simple solution, just choose from the range function.
import numpy as np
a = [100,400,100,300,300,200,100,400]
I=np.random.choice(np.arange(len(a)))
print('index is '+str(I)+' number is '+str(a[I]))
The question title versus its description are a bit different. I just wanted the answer to the title question which was getting only an (integer) index from numpy.random.choice(). Rather than any of the above, I settled on index = numpy.random.choice(len(array_or_whatever)) (tested in numpy 1.21.6).
Ex:
import numpy
a = [1, 2, 3, 4]
i = numpy.random.choice(len(a))
The problem I had in the other solutions were the unnecessary conversions to list which would recreate the entire collection in a new object (slow!).
Reference: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html?highlight=choice#numpy.random.choice
Key point from the docs about the first parameter a:
a: 1-D array-like or int
If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)
Since the question is very old then it's possible I'm coming at this from the convenience of newer versions supporting exactly what myself and the OP wanted.
This question might go closer to pattern matching in image processing.
Is there any way to get a cost function value, applied on different lists, which will return the inter-list proximity? For example,
a = [4, 7, 9]
b = [5, 8, 10]
c = [2, 3]
Now the cost function value for, may be a 2-tuple, (a, b) should be more than (a, c) and (b, c). This can be a huge computational task since there can be many more number of lists and all permutations would blow up the complexity of the problem. So only the set of 2-tuples would work as well.
EDIT:
The list names indicate the type of actions, and elements in them are the time at which corresponding actions occur. What I'm trying to do is to come up with set(s) of actions which have similar occurrence pattern. Since two actions cannot occur at the same time, it's the combination of intra- and inter-list distance.
Thanks in advance!
You're asking a very difficult question. Without allowing the sizes to change there are already several distance measures you could use (Euclidean, Manhattan, etc, check the See Also section for more). The one you need depends on what you think a good measure of the proximity is for whatever these lists represent.
Without knowing what you're trying to do with these lists no-one can define what a good answer would be, let alone how to compute it efficiently.
For comparing two strings or lists you can use the Levenshtein distance (Python implementation from here):
def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(l1 + 1)] * (l2 + 1)
for zz in range(l2 + 1):
matrix[zz] = range(zz,zz + l1 + 1)
for zz in range(0,l2):
for sz in range(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz] + 1)
return matrix[l2][l1]
Using that on your lists:
>>> a = [4, 7, 9]
>>> b = [5, 8, 10]
>>> c = [2, 3]
>>> levenshtein(a,b)
3
>>> levenshtein(b,c)
3
>>> levenshtein(a,c)
3
EDIT: with the added explanation in the comments, you could use sets instead of lists. Since every element of a set is unique, adding an existing element again is a no-op. And you can use the set's isdisjoint method to check that two sets do not contain the same elements, or the intersection method to see which elements they have in common:
In [1]: a = {1,3,5}
In [2]: a.add(3)
In [3]: a
Out[3]: set([1, 3, 5])
In [4]: a.add(4)
In [5]: a
Out[5]: set([1, 3, 4, 5])
In [6]: b = {2,3,7}
In [7]: a.isdisjoint(b)
Out[7]: False
In [8]: a.intersection(b)
Out[8]: set([3])
N.B.: this syntax of creating sets requires at least Python 2.7.
Given the answer you gave to Michael's clarification, you should probably look up "Dynamic Time Warping".
I haven't used http://mlpy.sourceforge.net/ but its blurb says it provides DTW. (Might be a hammer to crack a nut; depends on your use case.)