How to turn Numpy array to set efficiently? - python

I used:
df['ids'] = df['ids'].values.astype(set)
to turn lists into sets, but the output was a list not a set:
>>> x = np.array([[1, 2, 2.5],[12,35,12]])
>>> x.astype(set)
array([[1.0, 2.0, 2.5],
[12.0, 35.0, 12.0]], dtype=object)
Is there an efficient way to turn list into set in Numpy?
EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]

First flatten your ndarray to obtain a single dimensional array, then apply set() on it:
set(x.flatten())
Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x] to obtain a list of sets.

The current state of your question (can change any time): how can I efficiently remove unique elements from a large array of large arrays?
import numpy as np
rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]
Runtimes in an IPython shell:
>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Update: as #hpaulj pointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:
>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))
>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.

A couple of earlier 'row-wise' unique questions:
vectorize numpy unique for subarrays
Numpy: Row Wise Unique elements
Count unique elements row wise in an ndarray
In a couple of these the count is more interesting than the actual unique values.
If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

Related

setting all elements of a numpy array to zero vs. creating a new array of zeros

I'm doing a bunch of linear algebra on large matrices over and over millions of times in a while loop using numpy, at the beginning of each iteration I need the arrays to be all zeros.
Is it most efficient to reuse the existing arrays by setting all the elements to zero, ie: array[:, :, :] = 0 or to create a new array of all zeros, ie: array = np.zeros((a, b, c))
I would think that setting the elements to zero is best, but I don't know.
Setting a new array seems 1000 times faster on 10M cells
new array
%%timeit
a = np.zeros((1000,10000))
Output:
20.2 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
filling existing array
%%timeit
a[:,:] = 0
Output:
19.4 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Create new numpy array by duplicating each item in an array

I would like to create a new numpy array by repeating each item in another array by a given number of times (n). I am currently doing this with a for loop and .extend() with lists but this is not really efficient, especially for very large arrays.
Is there a more efficient way to do this?
def expandArray(array, n):
new_array = []
for item in array:
new_array.extend([item] * n)
new_array = np.array(new_array)
return new_array
print(expandArray([1,2,3,4], 3)
[1,1,1,2,2,2,3,3,3,4,4,4]
I don't know exactly why, but this code runs faster than np.repeat for me:
def expandArray(array, n):
return np.concatenate([array for i in range(0,n)])
I ran this little benchmark:
arr1 = np.random.rand(100000)
%timeit expandArray(arr1, 5)
1.07 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And np.repeat gives this result:
%timeit np.repeat(arr1,5)
2.45 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can create a duplicated array with the numpy.repeat function.
array = np.array([1,2,3,4])
new_array = np.repeat(array, repeats=3, axis=None)
print(new_array)
array([1,1,1,2,2,2,3,3,3,4,4,4])
new_array = np.repeat(array.reshape(-1,1).transpose(), repeats=3, axis=0).flatten()
print(new_array)
array([1,2,3,4,1,2,3,4,1,2,3,4])

Fastest way to append nonzero numpy array elements to list

I want to add all nonzero elements from a numpy array arr to a list out_list. Previous research suggests that for numpy arrays, using np.nonzero is most efficient. (My own benchmark below actually suggests it can be slightly improved using np.delete).
However, in my case I want my output to be a list, because I am combining many arrays for which I don't know the number of nonzero elements (so I can't effectively preallocate a numpy array for them). Hence, I was wondering whether there are some synergies that can be exploited to speed up the process. While my naive list comprehension approach is much slower than the pure numpy approach, I got some promising results combining list comprehension with numba.
Here's what I found so far:
import numpy as np
n = 60_000 # size of array
nz = 0.3 # fraction of zero elements
arr = (np.random.random_sample(n) - nz).clip(min=0)
# method 1
def add_to_list1(arr, out):
out.extend(list(arr[np.nonzero(arr)]))
# method 2
def add_to_list2(arr, out):
out.extend(list(np.delete(arr, arr == 0)))
# method 3
def add_to_list3(arr, out):
out += [x for x in arr if x != 0]
# method 4 (not sure how to get numba to accept an empty list as argument)
#njit
def add_to_list4(arr):
return [x for x in arr if x != 0]
out_list = []
%timeit add_to_list1(arr, out_list)
out_list = []
%timeit add_to_list2(arr, out_list)
out_list = []
%timeit add_to_list3(arr, out_list)
_ = add_to_list4(arr) # call once to compile
out_list = []
%timeit out_list.extend(add_to_list4(arr))
Yielding the following results:
2.51 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.19 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.6 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.63 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, numba outperforms all other methods. Among the rest, method 2 (using np.delete) is the best. Am I missing any obvious alternative that exploits the fact that I am converting to a list afterwards? Can you think of anything to further speed up the process?
Edit 1:
Performance of .tolist():
# method 5
def add_to_list5(arr, out):
out += arr[arr != 0].tolist()
# method 6
def add_to_list6(arr, out):
out += np.delete(arr, arr == 0).tolist()
# method 7
def add_to_list7(arr, out):
out += arr[arr.astype(bool)].tolist()
Timings are on par with numba:
1.62 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.65 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
1.78 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit 2:
Here's some benchmarking using Mad Physicist's suggestion to use np.concatenate to construct a numpy array instead.
# construct numpy array using np.concatenate
out_list = []
t = time.perf_counter()
for i in range(100):
out_list.append(arr[arr != 0])
result = np.concatenate(out_list)
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
# compare with best list-based method
out_list = []
t = time.perf_counter()
for i in range(100):
out_list += arr[arr != 0].tolist()
print(f"Time elapsed: {time.perf_counter() - t:.4f}s")
Concatenating numpy arrays yields indeed another significant speed-up, although it is not directly comparable since the output is a numpy array instead of a list. So it will depend on the precise use what will be best.
Time elapsed: 0.0400s
Time elapsed: 0.1430s
TLDR;
1/ using arr[arr != 0] is the fastest of all the indexing options
2/ using .tolist() instead of list(.) speeds up things by a factor 1.3 - 1.5
3/ with the gains of 1/ and 2/ combined, the speed is on par with numba
4/ if having a numpy array instead of a list is acceptable, then using np.concatenate yields another gain in speed by a factor of ~3.5 compared to the best alternative
I submit that the method of choice, if you are indeed looking for a list output, is:
def f(arr, out_list):
out_list += arr[arr != 0].tolist()
It seems to beat all the other methods mentioned so far in the OP's question or in other responses (at the time of this writing).
If, however, you are looking for a result as a numpy array, then following #MadPhysicist's version (slightly modified to use arr[arr != 0] instead of using np.nonzero()) is almost 6x faster, see at the end of this post.
Side note: I would avoid using %timeit out_list.extend(some_list): it keeps adding to out_list during the many loops of timeit. Example:
out_list = []
%timeit out_list.extend([1,2,3])
and now:
>>> len(out_list)
243333333 # yikes
Timings
On 60K items on my machine, I see:
out_list = []
a = %timeit -o out_list + arr[arr != 0].tolist()
b = %timeit -o out_list + arr[np.nonzero(arr)].tolist()
c = %timeit -o out_list + list(arr[np.nonzero(arr)])
Yields:
1.23 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.53 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4.29 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And:
>>> c.average / a.average
3.476
>>> b.average / a.average
1.244
For a numpy array result instead
Following #MadPhysicist, you can get some extra boost by not turning the arrays into lists, but using np.concatenate() instead:
def all_nonzero(arr_iter):
"""return non zero elements of all arrays as a np.array"""
return np.concatenate([a[a != 0] for a in arr_iter])
def all_nonzero_list(arr_iter):
"""return non zero elements of all arrays as a list"""
out_list = []
for a in arr_iter:
out_list += a[a != 0].tolist()
return out_list
from itertools import repeat
ta = %timeit -o all_nonzero(repeat(arr, 100))
tl = %timeit -o all_nonzero_list(repeat(arr, 100))
Yields:
39.7 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
227 ms ± 680 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
>>> tl.average / ta.average
5.75
Instead of extending a list by all of the elements of a new array, append the array itself. This will make for much fewer and smaller reallocations. You can also pre-allocate a list of Nones up-front or even use an object array, if you have an upper bound on the number of arrays you will process.
When you're done, call np.concatenate on the list.
So instead of this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.extend(arr[np.nonzero(arr)])
result = np.array(L)
Try this:
L = []
for i in range(10):
arr = (np.random.random_sample(n) - nz).clip(min=0)
L.append(arr[np.nonzero(arr)])
result = np.concatenate(L)
Since you're keeping arrays around, the final concatenation will be a series of buffer copies (which is fast), rather than a bunch of python-to numpy type conversions (which won't be). The exact method you choose for deletion is of course still up to the result of your benchmark.
Also, here's another method to add to your benchmark:
def add_to_list5(arr, out):
out.extend(list(arr[arr.astype(bool)]))
I don't expect this to be overwhelmingly fast, but it's interesting to see how masking stacks up next to indexing.

Form an item index in a masked array, calculate the index of the same item in the original sorted array

I masked a sorted 1-D numpy array using the method below (which follows a solution proposed here):
def get_from_sorted(sorted,idx):
mask = np.zeros(sorted.shape, bool)
mask[idx] = True
return sorted[mask]
The python method returns the array after masking on the indexes idx. For example, if sorted=np.array([0.1,0.2,0.3.0.4,0.5]), and idx=np.array([4,0,1]), then the method get_from_sorted should return np.array([0.1,0.2,0.5]) (note the order in the original array is preserved.)
Question: I need to get the mapping between the indices of the items in the array after masking and those in the original list. In the example above, such a mapping is
0 -> 0
1 -> 1
2 -> 5
because 0.1, 0.2, and 0.5 is on the 0th, 1st, and 5th place in sorted.
How can I program this mapping efficiently?
Requirement on efficiency: Efficiency is the key in my problem solving. Here, both "idx" and "sorted" is a 1-D array of 1 million elements, and idx is a 1-D array of about 0.5 million elements (taken from an image processing application). Thus, checking the elements of the masked array one by one, or in a vectorized fashion, against the original array, for example, using np.where, would not perform well in my case. Ideally, there should be a relatively simply mathematical relation between the indices in the masked array and the original sorted array. Any idea?
I assume (from your example) that the original list is the sorted list. In which case, unless I misunderstand, you just do:
idx.sort()
and then the mapping is i-> idx[i]
Of course, if the original order of idx is important, make a copy first.
A question is not clear for me. It can have several interpretations.
mask -> idx (in ascending order):
Let me try with this quite large dataset (10M of values, 10% of them are True):
x = np.random.choice(a=[False, True], size=(10000000,), p=[0.9, 0.1])
In this case usage of np.where is quite effective:
%timeit np.where(x)[0]
%timeit x.nonzero()[0]
%timeit np.arange(len(x))[x]
24.8 ms ± 551 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
24.5 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.4 ms ± 895 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
random items of sorted -> idx (in ascending order):
If you have lost any reference to positions of items you need to take from sorted, you're still able to find idx if there are no duplicate items. This is O(n logn):
x = np.random.choice(a=[False, True], size=(10000000,), p=[0.9, 0.1])
arr = np.linspace(0,1,len(x))
sub_arr = arr[x] %input data: skipping 90% of items
%timeit np.searchsorted(arr, sub_arr) %output data
112 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
idx (in any order) -> idx (in ascending order)
this is just simple:
x = np.arange(10000000)
np.random.shuffle(x)
idx = x[:1000000] #input data: first 1M of random idx
%timeit np.sort(idx) #output data
65.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you need to know where the masked entries came from, you can use one of np.where, np.nonzero or np.flatnonzero. However, if you need to get the origins of only a subset of the indices, you can use a function I recently wrote as part of my library, haggis: haggis.npy_util.unmasked_index1.
Given mask and the indices of some of your mask elements, you can retrieve a multi-dimensional index of the original locations with
unmasked_index(idx, mask)
If you ever need it, there is also an inverse function haggis.npy_util.masked_index that converts a location in a multidimensional input array into its index in the masked array.
1Disclaimer: I am the author of haggis.

Using np.apply_along_axis but on on certain indices

I have a numpy array, X, of shape (200, 200, 1500). I also have a function, func, that essentially returns the mean of an array (it does a few other things but they are are all numpy operations, you can think of it as np.mean). Now if I want to apply this function across the second array I could just do np.apply_along_axis(func, 2, X). But, I also have a truth array of shape (200, 200, 1500). I want to only apply func to places where the truth array has True. So it would ignore any places where the truth array is false. So going back to the np.mean example it would take the mean for each array index across the second axis but ignore some arbitrary set of indices.
So in practice, my solution would be to convertX into a new array Y with shape (200, 200) but the elements of the array are lists. This would be done using the truth array. Then apply func to each list in the array. The problem is this seems very time consuming since and I feel like there is a numpy oriented solution for this. Is there?
If what I said with the array list is the best way, how would I go about combining X and the truth array to get Y?
Any suggestions or comments appreciated.
In [268]: X = np.random.randint(0,100,(200,200,1500))
Let's check how apply works with just np.mean:
In [269]: res = np.apply_along_axis(np.mean, 2, X)
In [270]: res.shape
Out[270]: (200, 200)
In [271]: timeit res = np.apply_along_axis(np.mean, 2, X)
1.2 s ± 36.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
An equivalent using iteration on the first two dimensions. I'm using reshape to make it easier to write; speed should be about the same with a double loop.
In [272]: res1 = np.reshape([np.mean(row) for row in X.reshape(-1,1500)],(200,200))
In [273]: np.allclose(res, res1)
Out[273]: True
In [274]: timeit res1 = np.reshape([np.mean(row) for row in X.reshape(-1,1500)],(200,200))
906 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So apply may be convenient, but it is not a speed tool.
For speed in numpy you need to maximize the use of compiled code, and avoiding unnecessary python level loops.
In [275]: res2 = np.mean(X,axis=2)
In [276]: np.allclose(res2,res)
Out[276]: True
In [277]: timeit res2 = np.mean(X,axis=2)
120 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If using apply in your new case is hard, you don't loose anything by using something you do understand.
masked
In [278]: mask = np.random.randint(0,2, X.shape).astype(bool)
The [272] iteration can be adapted to work with mask:
In [279]: resM1 = np.reshape([np.mean(row[m]) for row,m in zip(X.reshape(-1,1500),mask.reshape(-1,150
...: 0))],X.shape[:2])
In [280]: timeit resM1 = np.reshape([np.mean(row[m]) for row,m in zip(X.reshape(-1,1500),mask.reshape
...: (-1,1500))],X.shape[:2])
1.43 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This might have problems if row[m] is empty. np.mean([]) produces a warning and nan value.
Applying the mask to X before any further processing looses dimensional information.
In [282]: X[mask].shape
Out[282]: (30001416,)
apply only works with one array, so it will be awkward (though not impossible) to use it to iterate on both X and mask. A structured array with data and mask fields might do the job. But the previous timings show, there's no speed advantage.
masked array
I don't usually expect masked arrays to offer speed, but this case it helps:
In [285]: xM = np.ma.masked_array(X, ~mask)
In [286]: resMM = np.ma.mean(xM, axis=2)
In [287]: np.allclose(resM1, resMM)
Out[287]: True
In [288]: timeit resMM = np.ma.mean(xM, axis=2)
849 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.nanmean
There's a set of functions that use np.nan masking:
In [289]: Xfloat = X.astype(float)
In [290]: Xfloat[~mask] = np.nan
In [291]: resflt = np.nanmean(Xfloat, axis=2)
In [292]: np.allclose(resM1, resflt)
Out[292]: True
In [293]: %%timeit
...: Xfloat = X.astype(float)
...: Xfloat[~mask] = np.nan
...: resflt = np.nanmean(Xfloat, axis=2)
...:
...:
2.17 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This doesn't help :(

Categories