Speed up search of array element in second array - python

I have a pretty simple operation involving two not so large arrays:
For every element in the first (larger) array, located in position i
Find if it exists in the second (smaller) array
If it does, find its index in the second array: j
Store a float taken from a third array (same length as first array) in the position i, in the position j of a fourth array (same length as second array)
The for block below works, but gets very slow for not so large arrays (>10000).
Can this implementation be made faster?
import numpy as np
import random
##############################################
# Generate some random data.
#'Nb' is always smaller then 'Na
Na, Nb = 50000, 40000
# List of IDs (could be any string, I use integers here for simplicity)
ids_a = random.sample(range(1, Na * 10), Na)
ids_a = [str(_) for _ in ids_a]
random.shuffle(ids_a)
# Some floats associated to these IDs
vals_in_a = np.random.uniform(0., 1., Na)
# Smaller list of repeated IDs from 'ids_a'
ids_b = random.sample(ids_a, Nb)
# Array to be filled
vals_in_b = np.zeros(Nb)
##############################################
# This block needs to be *a lot* more efficient
#
# For each string in 'ids_a'
for i, id_a in enumerate(ids_a):
# if it exists in 'ids_b'
if id_a in ids_b:
# find where in 'ids_b' this element is located
j = ids_b.index(id_a)
# store in that position the value taken from 'ids_a'
vals_in_b[j] = vals_in_a[i]

In defense of my approach, here is the authoritative implementation:
import itertools as it
def pp():
la,lb = len(ids_a),len(ids_b)
ids = np.fromiter(it.chain(ids_a,ids_b),'<S6',la+lb)
unq,inv = np.unique(ids,return_inverse=True)
vals = np.empty(la,vals_in_a.dtype)
vals[inv[:la]] = vals_in_a
return vals[inv[la:]]
(juanpa()==pp()).all()
# True
timeit(juanpa,number=100)
# 3.1373191522434354
timeit(pp,number=100)
# 2.5256317732855678
That said, #juanpa.arrivillaga's suggestion can also be implemented better:
import operator as op
def ja():
return op.itemgetter(*ids_b)(dict(zip(ids_a,vals_in_a)))
(ja()==pp()).all()
# True
timeit(ja,number=100)
# 2.015202699229121

I tried the approaches by juanpa.arrivillaga and Paul Panzer. The first one is the fastest by far. It is also the simplest. The second one is faster than my original approach, but considerably slower than the first one. It also has the drawback that this line vals[inv_a] = vals_in_a stores floats into a U5 array, thus converting them into strings. It can be converted back to floats at the end, but I lose digits (unless I'm missing something obvious of course.
Here are the implementations:
def juanpa():
dict_ids_b = {_: i for i, _ in enumerate(ids_b)}
for i, id_a in enumerate(ids_a):
try:
vals_in_b[dict_ids_b[id_a]] = vals_in_a[i]
except KeyError:
pass
return vals_in_b
def Paul():
# 1) concatenate ids_a and ids_b
ids_ab = ids_a + ids_b
# 2) apply np.unique with keyword return_inverse=True
vals, idxs = np.unique(ids_ab, return_inverse=True)
# 3) split the inverse into inv_a and inv_b
inv_a, inv_b = idxs[:len(ids_a)], idxs[len(ids_a):]
# 4) map the values to match the order of uniques: vals[inv_a] = vals_in_a
vals[inv_a] = vals_in_a
# 5) use inv_b to pick the correct values: result = vals[inv_b]
vals_in_b = vals[inv_b].astype(float)
return vals_in_b

Related

Delete 2D unique elements in a 2D NumPy array

I generate a set of unique coordinate combinations by using:
axis_1 = np.arange(image.shape[0])
axis_1 = np.reshape(axis_1,(axis_1.shape[0],1))
axis_2 = np.arange(image.shape[1])
axis_2 = np.reshape(axis_2,(axis_2.shape[0],1))
coordinates = np.array(np.meshgrid(axis_1, axis_2)).T.reshape(-1,2)
I then check for some condition and if it is satisfied i want to delete the coordinates from the array.
Something like this:
if image[coordinates[i,0], coordinates[i,1]] != 0:
remove coordinates i from coordinates
I tried the remove and delete commands but one doesn't work for arrays and the other simply just removes every instance where coordinates[i,0] and coordinates[i,1] appear, rather than the unique combination of both.
You can use np.where to generate the coordinate pairs that should be removed, and np.unique combined with masking to remove them:
y, x = np.where(image > 0.7)
yx = np.column_stack((y, x))
combo = np.vstack((coordinates, yx))
unique, counts = np.unique(combo, axis=0, return_counts=True)
clean_coords = unique[counts == 1]
The idea here is to stack the original coordinates and the coordinates-to-be-removed in the same array, then drop the ones that occur in both.
You can use the numpy.delete function, but this function returns a new modified array, and does not modify the array in-place (which would be quite problematic, specially in a for loop).
So your code would look like that:
nb_rows_deleted = 0
for i in range(0, coordinates.shape[0]):
corrected_i = i - nb_rows_deleted
if image[coordinates[corrected_i, 0], coordinates[corrected_i, 1]] != 0:
coordinates = np.delete(coordinates, corrected_i, 0)
nb_rows_deleted += 1
The corrected_i takes into consideration that some rows have been deleted during your loop.

Creating data in loop subject to moving condition

I am trying to create a list of data in a for loop then store this list in a list if it satisfies some condition. My code is
R = 10
lam = 1
proc_length = 100
L = 1
#Empty list to store lists
exponential_procs_lists = []
for procs in range(0,R):
#Draw exponential random variables
z_exponential = np.random.exponential(lam,proc_length)
#Sort values to increase
z_exponential.sort()
#Insert 0 at start of list
z_dat_r = np.insert(z_exponential,0,0)
sum = np.sum(np.diff(z_dat_r))
if sum < 5*L:
exponential_procs_lists.append(z_dat_r)
which will store some of the R lists that satisfies the sum < 5L condition. My question is, what is the best way to store R lists where the sum of each list is less than 5L? The lists can be different length but they must satisfy the condition that the sum of the increments is less than 5*L. Any help much appreciated.
Okay so based on your comment, I take that you want to generate an exponential_procs_list, inside which every sublist has a sum < 5*L.
Well, I modified your code to chop the sublists as soon as the sum exceeds 5*L.
Edit : See answer history to see my last answer for the approach above.
Well looking closer, notice you don't actually need the discrete difference array. You're finding the difference array, summing it up and checking whether the sum's < 5L and if it is, you append the original array.
But notice this:
if your array is like so: [0, 0.00760541, 0.22281415, 0.60476231], it's difference array would be [0.00760541 0.21520874 0.38194816].
If you add the first x terms of the difference array, you get the x+1th element of the original array. So you really just need to keep elements which are lesser than 5L:
import numpy as np
R = 10
lam = 1
proc_length = 5
L = 1
exponential_procs_lists = []
def chop(nums, target):
good_list = []
for num in nums:
if num >= target:
break
good_list.append(num)
return good_list
for procs in range(0,R):
z_exponential = np.random.exponential(lam,proc_length)
z_exponential.sort()
z_dat_r = np.insert(z_exponential,0,0)
good_list = chop(z_dat_r, 5*L)
exponential_procs_lists.append(good_list)
You could probably also just do a binary search(for better time complexity) or use a filter lambda, that's up to you.

Python set() and list() functions together

Looking through some python code I encountered this line:
x = list(set(range(height)) - set(array))
where array is just an int array. It's guaranteed that array's len is less than height.
Could someone please explain me how does it work?
Thanks!
Let's take sample values and see what's happening here
height = 5 # height has to be an integer for range()
array = [1,1,2,3]
x = list(set(range(height)) - set(array))
print(x) # [0,4]
Let's break it down into smaller pieces of code
height = 5
array = [1,1,2,3]
a = range(height) # Generates a list [0,1,2,3,4]
a_set = set(a) # Converts a into a set (0,1,2,3,4)
b_set = set(array) # Converts array into a set (1,2,3)
x_set = a_set - b_set # Does set operation A-B, ie, removes elements of B from A. (0,4)
x = list(x_set) # Converts it into a list [0,4]
print(x) # [0,4]
I an guessing height is an int.
What your script does is that it returns a list of all distinct values from 0 to 'height' , height not included, that do not appear in your array.
The range of a number returns 0 up to that number.
set() of a collection removes all duplicate elements
You can subtract two sets, which is called the difference, returning all elements that are not shared between them.
The result is casted back to a list and assigned to the variable

Creating fixed length numpy array from variable length lists

I have been generating variable length lists, a simplified example.
list_1 = [5] * 5
list_2 = [8] * 10
I then want to convert to np.array for manipulation. As such they need to be the same length (e.g. 1200), with the tail either populated with zeros, or truncated at the target length.
For a fixed length of 8, I considered setting up a zero array then filling the appropriate entries:
np_list_1 = np.zeros(8)
np_list_1[0:5] = list_1[0:5]
np_list_2 = np.zeros(8)
np_list_2[0:8] = list_2[0:8] # longer lists are truncated
I've created the following function to produce these
def get_np_fixed_length(list_like, length):
list_length = len(list_like)
np_array = np.zeros(length)
if list_length <= length:
np_array[0:list_length] = list_like[:]
else:
np_array[:] = list_like[0:length]
return np_array
Is there a more efficient way to do this? (I couldn't see anything in numpy documentation)
You did the job already. You may save some lines by using min function.
np_array = np.zeros(length)
l = min(length, len(list_like))
np_array[:l] = list_like[:l]
There is a bit more elegant way, by first creating an empty array, fill it up and then fill the tail with zeros.

How do I get an empty list of any size in Python?

I basically want a Python equivalent of this Array in C:
int a[x];
but in python I declare an array like:
a = []
and the problem is I want to assign random slots with values like:
a[4] = 1
but I can't do that with Python, since the Python list is empty (of length 0).
If by "array" you actually mean a Python list, you can use
a = [0] * 10
or
a = [None] * 10
You can't do exactly what you want in Python (if I read you correctly). You need to put values in for each element of the list (or as you called it, array).
But, try this:
a = [0 for x in range(N)] # N = size of list you want
a[i] = 5 # as long as i < N, you're okay
For lists of other types, use something besides 0. None is often a good choice as well.
You can use numpy:
import numpy as np
Example from Empty Array:
np.empty([2, 2])
array([[ -9.74499359e+001, 6.69583040e-309],
[ 2.13182611e-314, 3.06959433e-309]])
also you can extend that with extend method of list.
a= []
a.extend([None]*10)
a.extend([None]*20)
Just declare the list and append each element. For ex:
a = []
a.append('first item')
a.append('second item')
If you (or other searchers of this question) were actually interested in creating a contiguous array to fill with integers, consider bytearray and memoryivew:
# cast() is available starting Python 3.3
size = 10**6
ints = memoryview(bytearray(size)).cast('i')
ints.contiguous, ints.itemsize, ints.shape
# (True, 4, (250000,))
ints[0]
# 0
ints[0] = 16
ints[0]
# 16
It is also possible to create an empty array with a certain size:
array = [[] for _ in range(n)] # n equal to your desired size
array[0].append(5) # it appends 5 to an empty list, then array[0] is [5]
if you define it as array = [] * n then if you modify one item, all are changed the same way, because of its mutability.
x=[]
for i in range(0,5):
x.append(i)
print(x[i])
If you actually want a C-style array
import array
a = array.array('i', x * [0])
a[3] = 5
try:
[5] = 'a'
except TypeError:
print('integers only allowed')
Note that there's no concept of un-initialized variable in python. A variable is a name that is bound to a value, so that value must have something. In the example above the array is initialized with zeros.
However, this is uncommon in python, unless you actually need it for low-level stuff. In most cases, you are better-off using an empty list or empty numpy array, as other answers suggest.
The (I think only) way to assign "random slots" is to use a dictionary, e.g.:
a = {} # initialize empty dictionary
a[4] = 1 # define the entry for index 4 to be equal to 1
a['French','red'] = 'rouge' # the entry for index (French,red) is "rouge".
This can be handy for "quick hacks", and the lookup overhead is irrelevant if you don't have intensive access to the array's elements.
Otherwise, it will be more efficient to work with pre-allocated (e.g., numpy) arrays of fixed size, which you can create with a = np.empty(10) (for an non-initialized vector of length 10) or a = np.zeros([5,5]) for a 5x5 matrix initialized with zeros).
Remark: in your C example, you also have to allocate the array (your int a[x];) before assigning a (not so) "random slot" (namely, integer index between 0 and x-1).
References:
The dict datatype: https://docs.python.org/3/library/stdtypes.html#mapping-types-dict
Function np.empty(): https://numpy.org/doc/stable/reference/generated/numpy.empty.html

Categories