Replace values in a large list of arrays (performance) - python

I have a performance problem with replacing values of a list of arrays using a dictionary.
Let's say this is my dictionary:
# Create a sample dictionary
keys = [1, 2, 3, 4]
values = [5, 6, 7, 8]
dictionary = dict(zip(keys, values))
And this is my list of arrays:
# import numpy as np
# List of arrays
listvalues = []
arr1 = np.array([1, 3, 2])
arr2 = np.array([1, 1, 2, 4])
arr3 = np.array([4, 3, 2])
listvalues.append(arr1)
listvalues.append(arr2)
listvalues.append(arr3)
listvalues
>[array([1, 3, 2]), array([1, 1, 2, 4]), array([4, 3, 2])]
I then use the following function to replace all values in a nD numpy array using a dictionary:
# Replace function
def replace(arr, rep_dict):
rep_keys, rep_vals = np.array(list(zip(*sorted(rep_dict.items()))))
idces = np.digitize(arr, rep_keys, right=True)
return rep_vals[idces]
This function is really fast, however I need to iterate over my list of arrays to apply this function to each array:
replaced = []
for i in xrange(len(listvalues)):
replaced.append(replace(listvalues[i], dictionary))
This is the bottleneck of the process, as it needs to iterate over thousands of arrays.
How could I do achieve the same result without using the for-loop? It is important that the result is in the same format as the input (a list of arrays with replaced values)
Many thanks guys!!

This will do the trick efficiently, using the numpy_indexed package. It can be further simplified if all values in 'listvalues' are guaranteed to be present in 'keys'; but ill leave that as an exercise to the reader.
import numpy_indexed as npi
arr = np.concatenate(listvalues)
idx = npi.indices(keys, arr, missing='mask')
remap = np.logical_not(idx.mask)
arr[remap] = np.array(values)[idx[remap]]
replaced = np.array_split(arr, np.cumsum([len(a) for a in listvalues][:-1]))

Related

ordering an array based on values of another array

This question is probably basic for some of you but it I am new to Python. I have an initial array:
initial_array = np.array ([1, 6, 3, 4])
I have another array.
value_array= np.array ([10, 2, 3, 15])
I want an array called output array which looks at the values in value_array and reorder the initial array.
My result should look like this:
output_array = np.array ([4, 1, 3, 6])
Does anyone know if this is possible to do in Python?
So far I have tried:
for i in range(4):
find position of element
You can use numpy.argsort to find sort_index from value_array then rearrange the initial_array base sort_index in the reversing order with [::-1].
>>> idx_sort = value_array.argsort()
>>> initial_array[idx_sort[::-1]]
array([4, 1, 3, 6])
You could use stack to put arrays together - basically adding column to initial array, then sort by that column.
import numpy as np
initial_array = np.array ([1, 6, 3, 4])
value_array = np.array ([10, 2, 3, 15])
output_array = np.stack((initial_array, value_array), axis=1)
output_array=output_array[output_array[:, 1].argsort()][::-1]
print (output_array)
[::-1] part is for descending order. Remove to get ascending.
I am assuming initial_array and values_array will have same length.

Counting number of occurrences of an array in array of numpy 2D arrays

I have a numpy 2D array of arrays:
samples = np.array([[1,2,3], [2,3,4], [4,5,6], [1,2,3], [2,3,4], [2,3,4]])
I need to count how many times an array is inside of the array occurs above like:
counts = [[1,2,3]:2, [2,3,4]:3, [4,5,6]:1]
I'm not sure how this can get counted or listed out the way I have above to know which array and counts are connected to each other, any help is appreciated. Thank you!
Everything you need is directly in numpy:
import numpy as np
a = np.array([[1,2,3], [2,3,4], [4,5,6], [1,2,3], [2,3,4], [2,3,4]])
print(np.unique(a, axis=0, return_counts=True))
Result:
(array([[1, 2, 3],
[2, 3, 4],
[4, 5, 6]]), array([2, 3, 1], dtype=int64))
The result is a tuple of an array with the unique rows, and an array with the counts of those rows.
If you need to go through them pairwise:
unique_rows, counts = np.unique(a, axis=0, return_counts=True)
for row, c in zip(unique_rows, counts):
print(row, c)
Result:
[1 2 3] 2
[2 3 4] 3
[4 5 6] 1
Here's a method of doing without using much of the numpy library:
import numpy as np
samples = np.array([[1,2,3], [2,3,4], [4,5,6], [1,2,3], [2,3,4], [2,3,4]])
result = {}
for row in samples:
inDictionary = False
for check in range(len(result)):
if np.all(result[str(check)][0] == row):
result[str(check)][1]+= 1
inDictionary = True
else:
pass
if inDictionary == False:
result[str(len(result))] = [row, 1]
print("------------------")
print(result)
This method creates a dictionary called result and then loops through the various nested lists in samples and checks if they are already in the dictionary. If they are the count of how many times it has appeared is increased by 1. Otherwise, it creates a new entry for that array.
Now the counts and values that have been saved can be accessed using result["index"] for the index you want and result["index"][0] - for the array value & result["index"][1] - for the number of times it appeared.
There is a relatively fast method of Python in compare with other Python (no numpy) solutions:
from collections import Counter
>>> Counter(map(tuple, samples.tolist())) # convert to dict if you need it
Counter({(1, 2, 3): 2, (2, 3, 4): 3, (4, 5, 6): 1})
Python does it quite fast too because operations of tuple indexing are optimised pretty good
import benchit
%matplotlib inline
benchit.setparams(rep=3)
sizes = [3, 10, 30, 100, 300, 900, 3000, 9000, 30000, 90000, 300000, 900000, 3000000]
arr = np.random.randint(0,10, size=(sizes[-1], 3)).astype(int)
def count_python(samples):
return Counter(map(tuple, samples.tolist()))
def count_numpy(samples):
return np.unique(samples, axis=0, return_counts=True)
fns = [count_python, count_numpy]
in_ = {s: (arr[:s],) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Number of items')
t.plot(logx=True, figsize=(12, 6), fontsize=14)
Note that arr.tolist() consumes about 0.8sec/3M of Python computing time.

Indexing array from second element for all elements

I think it must be easy, but I cannot google it. Suppose I have array of numbers 1, 2, 3, 4.
import numpy as np
a = np.array([1,2,3,4])
How to index array if I want sequence 2, 3, 4, 1??
I know that for sequence 2, 3, 4 I can choose e.g.:
print(a[1::1])
If you want to rotate the list, you can use a deque instead of a numpy array. This data structure is designed for this kind of operation and directly provides a rotate function.
>>> from collections import deque
>>> a = deque([1, 2, 3, 4])
>>> a.rotate(-1)
>>> a
deque([2, 3, 4, 1])
If you want to use Numpy, you can check out the roll function.
>>> import numpy as np
>>> a = np.array([1,2,3,4])
>>> np.roll(a, -1)
array([2, 3, 4, 1])
One possible way is to define index set (a list).
index_set = [1, 2, 3, 0]
print(a[index_set])

need to grab entire first element (array) in a multi-dimentional list of arrays python3

Apologies if this has already been asked, but I searched quite a bit and couldn't find quite the right solution. I'm new to python, but I'll try to be as clear as possible. In short, I have a list of arrays in the following format resulting from a joining a multiprocessing pool:
array = [[[1,2,3], 5, 47, 2515],..... [[4,5,6], 3, 35, 2096]]]
and I want to get all values from the first array element to form a new array in the following form:
print(new_array)
[1,2,3,4,5,6]
In my code, I was trying to get the first value through this function:
new_array = array[0][0]
but this only returns the first value as such:
print(new_array)
[1,2,3]
I also tried np.take after converting the array into a np array:
array = np.array(array)
new_array = np.take(results,0)
print(new_array)
[1,2,3]
I have tried a number of np functions (concatenate, take, etc.) to try and iterate this over the list, but get back the following error (presumably because the size of the array changes):
ValueError: autodetected range of [[], [1445.0, 1445.0, -248.0, 638.0, -108.0, 649.0]] is not finite
Thanks for any help!
You can achieve it without numpy using reduce:
from functools import reduce
l = [[[1,2,3], 5, 47, 2515], [[4,5,6], 3, 35, 2096]]
res = reduce(lambda a, b: [*a, *b], [x[0] for x in l])
Output
[1, 2, 3, 4, 5, 6]
Maybe it is worth mentioning that [*a, *b] is a way to concatenate lists in python, for example:
[*[1, 2, 3], *[4, 5, 6]] # [1, 2, 3, 4, 5, 6]
You could also use itertools' chain() function to flatten an extraction of the first subArray in each element of the list:
from itertools import chain
result = list(chain(*[sub[0] for sub in array]))

Get list of indices matching condition with NumPy [duplicate]

Is there any way to get the indices of several elements in a NumPy array at once?
E.g.
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
I would like to find the index of each element of a in b, namely: [0,1,4].
I find the solution I am using a bit verbose:
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
c = np.zeros_like(a)
for i, aa in np.ndenumerate(a):
c[i] = np.where(b == aa)[0]
print('c: {0}'.format(c))
Output:
c: [0 1 4]
You could use in1d and nonzero (or where for that matter):
>>> np.in1d(b, a).nonzero()[0]
array([0, 1, 4])
This works fine for your example arrays, but in general the array of returned indices does not honour the order of the values in a. This may be a problem depending on what you want to do next.
In that case, a much better answer is the one #Jaime gives here, using searchsorted:
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([0, 1, 4])
This returns the indices for values as they appear in a. For instance:
a = np.array([1, 2, 4])
b = np.array([4, 2, 3, 1])
>>> sorter = np.argsort(b)
>>> sorter[np.searchsorted(b, a, sorter=sorter)]
array([3, 1, 0]) # the other method would return [0, 1, 3]
This is a simple one-liner using the numpy-indexed package (disclaimer: I am its author):
import numpy_indexed as npi
idx = npi.indices(b, a)
The implementation is fully vectorized, and it gives you control over the handling of missing values. Moreover, it works for nd-arrays as well (for instance, finding the indices of rows of a in b).
All of the solutions here recommend using a linear search. You can use np.argsort and np.searchsorted to speed things up dramatically for large arrays:
sorter = b.argsort()
i = sorter[np.searchsorted(b, a, sorter=sorter)]
For an order-agnostic solution, you can use np.flatnonzero with np.isin (v 1.13+).
import numpy as np
a = np.array([1, 2, 4])
b = np.array([1, 2, 3, 10, 4])
res = np.flatnonzero(np.isin(a, b)) # NumPy v1.13+
res = np.flatnonzero(np.in1d(a, b)) # earlier versions
# array([0, 1, 2], dtype=int64)
There are a bunch of approaches for getting the index of multiple items at once mentioned in passing in answers to this related question: Is there a NumPy function to return the first index of something in an array?. The wide variety and creativity of the answers suggests there is no single best practice, so if your code above works and is easy to understand, I'd say keep it.
I personally found this approach to be both performant and easy to read: https://stackoverflow.com/a/23994923/3823857
Adapting it for your example:
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
indices = [b_list.index(x) for x in a]
vals_at_indices = b_array[indices]
I personally like adding a little bit of error handling in case a value in a does not exist in b.
import numpy as np
a = np.array([1, 2, 4])
b_list = [1, 2, 3, 10, 4]
b_array = np.array(b_list)
b_set = set(b_list)
indices = [b_list.index(x) if x in b_set else np.nan for x in a]
vals_at_indices = b_array[indices]
For my use case, it's pretty fast, since it relies on parts of Python that are fast (list comprehensions, .index(), sets, numpy indexing). Would still love to see something that's a NumPy equivalent to VLOOKUP, or even a Pandas merge. But this seems to work for now.

Categories