Find the location(Indices) of N elements in a huge numpy array

Find the location(Indices) of N elements in a huge numpy array - python

I have a set of say, 5 elements,
[21,103,3,10,243]
and a huge Numpy array
[4,5,1,3,5,100,876,89,78......456,64,3,21,245]
with the 5 elements appearing repetitively in the bigger array.
I want to find all the Indices where the elements of the small list appears in the larger array.
The small list will be less than 100 elements long and the large list will be about 10^7 elements long, and so, speed is a concern here. What is the most elegant and the fastest way to do it in python3.x ?
I have tried using np.where() but it works dead slow. Looking for a faster way.

You can put the 100 elements to be found in a set, a hash table.
Then loop through the elements of the huge array asking if the element is in the set.
S = set([21,103,3,10,243])
A = [4,5,1,3,5,100,876,89,78......456,64,3,21,245]
result = []
for i,x in enumerate(A):
if x in S:
result.append(i)

To speed up things, you can optimize like this:
Sort the larger array
Perform binary search (on the larger array) for each number in the smaller array.
Time Complexity
Sorting using numpy.sort(kind='heapsort') will have time complexity n*log(n).
Binary search will have complexity log(n) for each element in the smaller array. Assuming, there are m elements in the smaller array, the total search complexity will be m*log(n).
Overall, this will provide you good optimization.

smaller_array = [21,103,3,10,243]
bigger_array = [4,5,1,3,5,100,876,89,78,456,64,3,21,243,243]
print(bigger_array)
print(smaller_array)
for val in smaller_array:
if val in bigger_array:
c = bigger_array.index(val)
while True:
print(f'{val} is found in bigger_array at index {bigger_array.index(val,c)}')
c = bigger_array.index(val,c)+1
if val not in bigger_array[c:]:
break

smaller_array = [21,103,3,10,243]
bigger_array = [4,5,1,3,5,100,876,89,78,456,64,3,21,243,243]
print(bigger_array)
print(smaller_array)
for val in smaller_array:
if val in bigger_array:
c=0
try:
while True:
c = bigger_array.index(val,c)
print(f'{val} is found in bigger_array at index {c}')
c+=1
except:
pass

Related

Efficiently adding two different sized one dimensional arrays

I want to add two numpy arrays of different sizes starting at a specific index. As I need to do this couple of thousand times with large arrays, this needs to be efficient, and I am not sure how to do this efficiently without iterating through each cell.
a = [5,10,15]
b = [0,0,10,10,10,0,0]
res = add_arrays(b,a,2)
print(res) => [0,0,15,20,25,0,0]
naive approach:
# b is the bigger array
def add_arrays(b, a, i):
for j in range(len(a)):
b[i+j] = a[j]

You might assign smaller one into zeros array then add, I would do it following way
import numpy as np
a = np.array([5,10,15])
b = np.array([0,0,10,10,10,0,0])
z = np.zeros(b.shape,dtype=int)
z[2:2+len(a)] = a # 2 is offset
res = z+b
print(res)
output
[ 0 0 15 20 25 0 0]
Disclaimer: I assume that offset + len(a) is always less or equal len(b).

Nothing wrong with your approach. You cannot get better asymptotic time or space complexity. If you want to reduce code lines (which is not an end in itself), you could use slice assignment and some other utils:
def add_arrays(b, a, i):
b[i:i+len(a)] = map(sum, zip(b[i:i+len(a)], a))
But the functional overhead should makes this less performant, if anything.
Some docs:
map
sum
zip

It should be faster than Daweo answer, 1.5-5x times (depending on the size ratio between a and b).
result = b.copy()
result[offset: offset+len(a)] += a

Is there a more efficient way to find the missing integer?

I'm currently studying a module called data structures and algorithms at a university. We've been tasked with writing an algorithm that finds the smallest positive integer which does not occur in a given sequence. I was able to find a solution, but is there a more efficient way?
x = [5, 6, 3, 1, 2]
def missing_integer():
for i in range(1, 100):
if i not in x:
return i
print(missing_integer())
The instructions include some examples:
given x = [1, 3, 6, 4, 1, 2], the function should return 5,
given x = [1, 2, 3], the function should return 4 and
given x = [−1, −3], the function should return 1.

You did not ask for the most efficient way to solve the problem, just if there is a more efficient way than yours. The answer to that is yes.
If the missing integer is near the top of the range of the integers and the list is long, your algorithm as a run-time efficiency of O(N**2)--your loop goes through all possible values, and the not in operator searches through the entire list if a match is not found. (Your code searches only up to the value 100--I assume that is just a mistake on your part and you want to handle sequences of any length.)
Here is a simple algorithm that is merely order O(N*log(N)). (Note that quicker algorithms exist--I show this one since it is simple and thus answers your question easily.) Sort the sequence (which has the order I stated) then run through it starting at the smallest value. This linear search will easily find the missing positive integer. This algorithm also has the advantage that the sequence could involve negative numbers, non-integer numbers, and repeated numbers, and the code could easily handle those. This also handles sequences of any size and with numbers of any size, though of course it runs longer for longer sequences. If a good sort routine is used, the memory usage is quite small.

I think the O(n) algorithm goes like this: initialise an array record of length n + 2 (list in Python) to None, and iterate over the input. If the element is one of the array indexes, set the element in the record to True. Now iterate over the new list record starting from index 1. Return the first None encountered.

The slow step in your algorithm is that line:
if i not in x:
That step takes linear time, which makes the entire algorithm O(N*N). If you first turn the list into a set, the lookup is much faster:
def missing_integer():
sx = set(x)
for i in range(1, 100):
if i not in sx:
return i
Lookup in a set is fast, in fact it takes constant time, and the algorithm now runs in linear time O(N).

Another solution is creating an array with a size of Max value, and traverse the array and marking each location of the array when that value is seen. Then, iterate from the start of the array and report the first finding unlabeled location as the smallest missing value. This is done in O(n) (Fill the array and finding the smallest unlabeled location).
Also, for negative values you can add all values the Min value to find all values positive. Then, apply the above method.
The space complexity of this method is \Theta(n).
To know more, see this post about the implementation and scrutinize more about this method.

Can be done in O(n) time with a bit of maths. initialise a minimum_value and maximum_value, and sum_value names then loop once through the numbers to find the minimum and maximum and the sum of all the numbers (mn, mx, sm).
Now the sum of integers 0..n = n*(n-1)/2 = s(n)
Therefore: missing_number = (s(mx) - s(mn)) - sm
All done with traversing the numbers only once!

My answer using list comprehension:
def solution(A):
max_val = max(A);
min_val = min(A);
if max_val<0: val = 1;
elif max_val > 0:
li = [];
[li.append(X) for X in range(min_val,max_val) if X not in A];
if len(li)>0:
if min(li)<0: val = 1;
else: val = min(li);
if len(li)==0: val=max_val+1;
return val;
L = [-1, -3];
res = solution(L);
print(res);

Why is the first solution way much slower than the second one, to calculate the sum of prime numbers?

I was solving a Euler's project problem (in python) where I had to calculate the sum of all prime numbers below 2 million, I came up with two solution but only one got me the result because the other one was too slow. This is my first solution that was too slow:
n=2000000
list = list(range(2,n,1))
for x in list :
if(x*x>n):
break
for d in range(x,n,x):
if((x!=d) and (d in list) ):
list.remove(d)
continue
result=sum(list)
print(result)
This is my second solution, which was pretty fast at calculating the sum:
n=2000000
list = list(range(2,n,1))
temp=list
for x in list :
if(x*x>n):
break
if(temp[x-2]==0):
continue
for d in range(x,n,x):
if(x!=d):
temp[d-2]=0
result=sum(list)
print(result)
What I would like to know is why the last one calculated the sum almost instantaneously while the other didn't even produce a result after several minutes?

Look for the number of loops in your code. In the first solution, you have around 4 loops for x in list, for d in range(x,n,x):, (d in list) and list.remove(d) also loop through the list to remove d. But in case of the second solution, you have only two loop for x in list : and for d in range(x,n,x):

Comparing algorithms for List intersection

I am attempting to design an algorithm to find common elements between sorted and distinct arrays. I am using one of the following two methods. Is either one better in terms of runtime and time complexity?
Method 1:
# O(n^2) ?
common = []
def intersect(array1,array2):
dict1 = {}
for item in array1:
dict1.update({item:0})
for k,v in dict1.iteritems():
if k in array2:
common.append(k)
return common
print intersect(array1=[1,2,3,5], array2 = [5,6,7,8,9])
Method 2:
# probably O(n^2)
common = []
def intersect(array1,array2):
for item1 in array1:
for item2 in array2:
if (item1==item2):
common.append(item1)
return common
print intersect(array1=[1,2,3,5], array2 = [5,6,7,8,9])

Let array1 has M elements and array2 has N elements. The first approach has time complexity O(M lg N). The second approach has time complexity O(M*N). So, from time complexity perspective, the first is better. Note, however, that the first approach has O(M) space complexity which the second one does not.
BTW, there is likely a O(max(M, N)) algorithm.

set(array1).intersection(set(array2)) will likely be the fastest solution. The intersection method is lightning fast and easy to implement. Not sure about its time complexity but you may want to take a look at its implementation.

What's a fast and pythonic/clean way of removing a sorted list from another sorted list in python?

I am creating a fast method of generating a list of primes in the range(0, limit+1). In the function I end up removing all integers in the list named removable from the list named primes. I am looking for a fast and pythonic way of removing the integers, knowing that both lists are always sorted.
I might be wrong, but I believe list.remove(n) iterates over the list comparing each element with n. meaning that the following code runs in O(n^2) time.
# removable and primes are both sorted lists of integers
for composite in removable:
primes.remove(composite)
Based off my assumption (which could be wrong and please confirm whether or not this is correct) and the fact that both lists are always sorted, I would think that the following code runs faster, since it only loops over the list once for a O(n) time. However, it is not at all pythonic or clean.
i = 0
j = 0
while i < len(primes) and j < len(removable):
if primes[i] == removable[j]:
primes = primes[:i] + primes[i+1:]
j += 1
else:
i += 1
Is there perhaps a built in function or simpler way of doing this? And what is the fastest way?
Side notes: I have not actually timed the functions or code above. Also, it doesn't matter if the list removable is changed/destroyed in the process.
For anyone interested the full functions is below:
import math
# returns a list of primes in range(0, limit+1)
def fastPrimeList(limit):
if limit < 2:
return list()
sqrtLimit = int(math.ceil(math.sqrt(limit)))
primes = [2] + range(3, limit+1, 2)
index = 1
while primes[index] <= sqrtLimit:
removable = list()
index2 = index
while primes[index] * primes[index2] <= limit:
composite = primes[index] * primes[index2]
removable.append(composite)
index2 += 1
for composite in removable:
primes.remove(composite)
index += 1
return primes

This is quite fast and clean, it does O(n) set membership checks, and in amortized time it runs in O(n) (first line is O(n) amortized, second line is O(n * 1) amortized, because a membership check is O(1) amortized):
removable_set = set(removable)
primes = [p for p in primes if p not in removable_set]
Here is the modification of your 2nd solution. It does O(n) basic operations (worst case):
tmp = []
i = j = 0
while i < len(primes) and j < len(removable):
if primes[i] < removable[j]:
tmp.append(primes[i])
i += 1
elif primes[i] == removable[j]:
i += 1
else:
j += 1
primes[:i] = tmp
del tmp
Please note that constants also matter. The Python interpreter is quite slow (i.e. with a large constant) to execute Python code. The 2nd solution has lots of Python code, and it can indeed be slower for small practical values of n than the solution with sets, because the set operations are implemented in C, thus they are fast (i.e. with a small constant).
If you have multiple working solutions, run them on typical input sizes, and measure the time. You may get surprised about their relative speed, often it is not what you would predict.

The most important thing here is to remove the quadratic behavior. You have this for two reasons.
First, calling remove searches the entire list for values to remove. Doing this takes linear time, and you're doing it once for each element in removable, so your total time is O(NM) (where N is the length of primes and M is the length of removable).
Second, removing elements from the middle of a list forces you to shift the whole rest of the list up one slot. So, each one takes linear time, and again you're doing it M times, so again it's O(NM).
How can you avoid these?
For the first, you either need to take advantage of the sorting, or just use something that allows you to do constant-time lookups instead of linear-time, like a set.
For the second, you either need to create a list of indices to delete and then do a second pass to move each element up the appropriate number of indices all at once, or just build a new list instead of trying to mutate the original in-place.
So, there are a variety of options here. Which one is best? It almost certainly doesn't matter; changing your O(NM) time to just O(N+M) will probably be more than enough of an optimization that you're happy with the results. But if you need to squeeze out more performance, then you'll have to implement all of them and test them on realistic data.
The only one of these that I think isn't obvious is how to "use the sorting". The idea is to use the same kind of staggered-zip iteration that you'd use in a merge sort, like this:
def sorted_subtract(seq1, seq2):
i1, i2 = 0, 0
while i1 < len(seq1):
if seq1[i1] != seq2[i2]:
i2 += 1
if i2 == len(seq2):
yield from seq1[i1:]
return
else:
yield seq1[i1]
i1 += 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the location(Indices) of N elements in a huge numpy array - python

You can put the 100 elements to be found in a set, a hash table. Then loop through the elements of the huge array asking if the element is in the set. S = set([21,103,3,10,243]) A = [4,5,1,3,5,100,876,89,78......456,64,3,21,245] result = [] for i,x in enumerate(A): if x in S: result.append(i)

Related

Efficiently adding two different sized one dimensional arrays

Is there a more efficient way to find the missing integer?

Why is the first solution way much slower than the second one, to calculate the sum of prime numbers?

Comparing algorithms for List intersection

What's a fast and pythonic/clean way of removing a sorted list from another sorted list in python?

Categories

Resources