I was trying to make a pure-python (without external dependencies) element-wise comparison of two sequences. My first solution was:
list(map(operator.eq, seq1, seq2))
Then I found starmap function from itertools, which seemed pretty similar to me. But it turned out to be 37% faster on my computer in worst case. As it was not obvious to me, I measured the time necessary to retrieve 1 element from a generator (don't know if this way is correct):
from operator import eq
from itertools import starmap
seq1 = [1,2,3]*10000
seq2 = [1,2,3]*10000
seq2[-1] = 5
gen1 = map(eq, seq1, seq2))
gen2 = starmap(eq, zip(seq1, seq2))
%timeit -n1000 -r10 next(gen1)
%timeit -n1000 -r10 next(gen2)
271 ns ± 1.26 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
208 ns ± 1.72 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
In retrieving elements the second solution is 24% more performant. After that, they both produce the same results for list. But from somewhere we gain extra 13% in time:
%timeit list(map(eq, seq1, seq2))
%timeit list(starmap(eq, zip(seq1, seq2)))
5.24 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.34 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I don't know how to dig deeper in profiling of such nested code? So my question is why the first generator so faster in retrieving and from where we gain extra 13% in list function?
EDIT:
My first intention was to perform element-wise comparison instead of all, so the all function was replaced with list. This replacement does not affect the timing ratio.
CPython 3.6.2 on Windows 10 (64bit)
There are several factors that contribute (in conjunction) to the observed performance difference:
zip re-uses the returned tuple if it has a reference count of 1 when the next __next__ call is made.
map builds a new tuple that is passed to the "mapped function" every time a __next__ call is made. Actually it probably won't create a new tuple from scratch because Python maintains a storage for unused tuples. But in that case map has to find an unused tuple of the right size.
starmap checks if the next item in the iterable is of type tuple and if so it just passes it on.
Calling a C function from within C code with PyObject_Call won't create a new tuple that is passed to the callee.
So starmap with zip will only use one tuple over and over again that is passed to operator.eq thus reducing the function call overhead immensely. map on the other hand will create a new tuple (or fill a C array from CPython 3.6 on) every time operator.eq is called. So what is actually the speed difference is just the tuple creation overhead.
Instead of linking to the source code I'll provide some Cython code that can be used to verify this:
In [1]: %load_ext cython
In [2]: %%cython
...:
...: from cpython.ref cimport Py_DECREF
...:
...: cpdef func(zipper):
...: a = next(zipper)
...: print('a', a)
...: Py_DECREF(a)
...: b = next(zipper)
...: print('a', a)
In [3]: func(zip([1, 2], [1, 2]))
a (1, 1)
a (2, 2)
Yes, tuples aren't really immutable, a simple Py_DECREF was sufficient to "trick" zip into believing noone else holds a reference to the returned tuple!
As for the "tuple-pass-thru":
In [4]: %%cython
...:
...: def func_inner(*args):
...: print(id(args))
...:
...: def func(*args):
...: print(id(args))
...: func_inner(*args)
In [5]: func(1, 2)
1404350461320
1404350461320
So the tuple is passed right through (just because these are defined as C functions!) This doesn't happen for pure Python functions:
In [6]: def func_inner(*args):
...: print(id(args))
...:
...: def func(*args):
...: print(id(args))
...: func_inner(*args)
...:
In [7]: func(1, 2)
1404350436488
1404352833800
Note that it also doesn't happen if the called function isn't a C function even if called from a C function:
In [8]: %%cython
...:
...: def func_inner_c(*args):
...: print(id(args))
...:
...: def func(inner, *args):
...: print(id(args))
...: inner(*args)
...:
In [9]: def func_inner_py(*args):
...: print(id(args))
...:
...:
In [10]: func(func_inner_py, 1, 2)
1404350471944
1404353010184
In [11]: func(func_inner_c, 1, 2)
1404344354824
1404344354824
So there are a lot of "coincidences" leading up to the point that starmap with zip is faster than calling map with multiple arguments when the called function is also a C function...
One difference I can notice is the how map retrieves items from the iterables. Both map and zip create a tuple of iterators from each iterable passed. Now zip maintains a result tuple internally that is populated every time next is called and on the other hand, map creates a new array* with each next call and deallocates it.
*As pointed out by MSeifert till 3.5.4 map_next used to allocate a new Python tuple everytime. This changed in 3.6 and till 5 iterables C stack is used and for anything larger than that heap is used. Related PRs: Issue #27809: map_next() uses fast call and Add _PY_FASTCALL_SMALL_STACK constant | Issue: https://bugs.python.org/issue27809
Related
I have to make this program that multiplies an like this:
The first number of the first list of the first array with the first number of the first list of the second array. For example:
Input
array1 = [[1,2,3], [3,2,1]]
array2 = [[4,2,5], [5,6,7]]
So my output must be:
result = [[4,4,15],[15,12,7]]
So far my code is the following:
def multiplyArrays(array1,array2):
if verifySameSize(array1,array2):
for i in array1:
for j in i:
digitA1 = j
for x in array2:
for a in x:
digitA2 = a
mult = digitA1 * digitA2
return mult
return 'Arrays must be the same size'
It's safe to say it's not working since the result I'm getting for the example I gave is 7 , not even an array, so, what am I doing wrong?
if you want a simple solution, use numpy:
import numpy as np
array1 = np.array([[1,2,3], [3,2,1]])
array2 = np.array([[4,2,5], [5,6,7]])
result = array1 * array2
if you want a general solution for your own understanding, then it becomes a bit harder: how in-depth do you want the implementation to be? there are many checks for example the same sizes, same types, number of dimensions, etc.
the problem in your code is using for each loop instead of indexing. for i in array1 runs twice, returning a list (first [1,2,3] then [3,2,1]). then you do a for each loop in each list returning a number, meaning you only get 1 number as the output which is the result of the last operation (1 * 7 = 7). You should create an empty list and append your results in a normal for loop (not for each).
so your function becomes:
def multiplyArrays(array1,array2):
result = []
for i in range(len(array1)):
result.append([])
for j in range(len(array1[i])):
result[i].append(array1[i][j]*array2[i][j])
return result
this is a bad idea though because it only works with 2D arrays and there are no checks. Avoid writing your own functions unless you absolutely need to.
You can use zip() to iterate over the lists at the same time:
array1 = [[1,2,3], [3,2,1]]
array2 = [[4,2,5], [5,6,7]]
def multiplyArrays(array1,array2):
result = []
for inner1,inner2 in zip(array1,array2):
inner = []
for item1,item2 in zip(inner1,inner2):
inner.append(item1*item2)
result.append(inner)
return result
print(multiplyArrays(array1,array2))
Output as requested.
Here are three pure-Python one-liners that yield your expected output, two of which are simply list comprehension versions of the other two answers. List comprehension equivalents are generally more efficient, but you should choose what is most readable for you.
Method 1
#quamrana's, as a list comprehension.
res = [[a * b for a, b in zip(c, d)] for c, d in zip(arr1, arr2)]
Method 2 #OM222O's, as a list comprehension.
res = [[ arr1[i][j] * arr2[i][j] for j in range(len(arr1[0])) ] for i in range(len(arr1))]
Method 3 Similar to Method 1 but makes use of operator.mul(a, b) (returns a * b) from the operator module and the built-in map(function, iterable, ...) function. The map function "[r]eturn[s] an iterator that applies function to every item of iterable, yielding the results." So given two lists a (from array1) and b (from array2), map(operator.mul, a, b) returns an iterator that yields the results of multiplying each element in a with the element in b with the same index. list() converts the results into a list.
res = [list(map(operator.mul, a, b)) for a, b in zip(arr1, arr2)]
Simple Benchmark
Input
from random import randint
arr1 = [[randint(1, 25) for i in range(1_000)] for j in range(1_000)]
arr2 = [[randint(1, 25) for i in range(1_000)] for j in range(1_000)]
Ordered from fastest to slowest
# Method 3
29.2 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Method 1
44.4 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Method 2
79.3 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy multiplication (inclusive of time required to convert list to array)
81.7 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can see that Method 3 (the operator.mul approach) appears fastest and the numpy approach appears the slowest. There is a big caveat, of course, as the numpy timings included the time required to convert the lists to arrays. In order to make meaningful comparisons, we need to specify whether the input and/or output is a list and/or an array. Clearly, if the inputs are already lists and the results must also be lists, then we can be happy with standard Python approaches.
However, if arr1 and arr2 are already numpy arrays, element-wise multiplication is incredibly fast:
1.47 ms ± 5.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
More simpler approach without using any module.
array1 = [[1, 2, 3], [3, 2, 1]]
array2 = [[4, 2, 5], [5, 6, 7]]
result = []
i = 0
while i < len(array1):
sub_array1 = array1[i]
sub_array2 = array2[i]
a, b, c = sub_array1
d, e, f = sub_array2
inner_list = [a * d, b * e, c * f]
result.append(inner_list)
i += 1
print(result)
Output:
[[4,4,15],[15,12,7]]
What could be a reason for performance degradation in the following numba compiled function for logic comparison:
from numba import njit
t = (True, 'and_', False)
##njit(boolean(boolean, unicode_type, boolean))
#njit
def f(a,b,c):
if b == 'and_':
out = a&c
elif b == 'or_':
out = a|c
return out
x = f(*t)
%timeit f(*t)
#1.78 µs ± 9.52 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit f.py_func(*t)
#108 ns ± 0.0042 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
To test this at scale as suggested in the answer:
x = np.random.choice([True,False], 1000000)
y = np.random.choice(["and_","or_"], 1000000)
z = np.random.choice([False, True], 1000000)
#using jit compiled f
def f2(x,y,z):
L = x.shape[0]
out = np.empty(L)
for i in range(L):
out[i] = f(x[i],y[i],z[i])
return out
%timeit f2(x,y,z)
#2.79 s ± 86.4 ms per loop
#using pure Python f
def f3(x,y,z):
L = x.shape[0]
out = np.empty(L)
for i in range(L):
out[i] = f.py_func(x[i],y[i],z[i])
return out
%timeit f3(x,y,z)
#572 ms ± 24.3 ms per
Am I missing something and if there a way to compile "fast" version, because this is a going to be part of a loop executed ~ 1e6 times.
You are working at a too small granularity. Numba is not designed for that. Almost all the execution time you see comes from the overhead of wrapping/unwrapping parameters, type checks, Python function wrapping, reference counting, etc. Moreover the benefit of using Numba is very small here since Numba barely optimizes unicode string operations.
One way to check this hypothesis is to just execute the following trivial function:
#njit
def f(a,b,c):
return a
x = f(True, 'and_', False)
%timeit f(True, 'and_', False)
Both the trivial function and the original version takes 1.34 µs on my machine.
Additionally, you can disassemble the Numba function to see how much instructions are executed to perform just one call and understand deeply where the overheads are coming from.
If you want Numba to be useful, you need to add more work in the compiled function, possibly by working directly on arrays/lists. If this is not possible because of the dynamic nature of the input type, then Numpy may not be the right tool for this here. You could try to rework a bit your code and use PyPy instead. Writing a native C/C++ module may help a bit but most of the time will be spend in manipulating dynamic objects and unicode string as well as doing type introspection, unless you rewrite the whole code.
UPDATE
The above overhead is only paid when transitioning from Python types to Numba (and the other way around). You can see that with the following benchmark:
#njit
def f(a,b,c):
if b == 'and_':
out = a&c
elif b == 'or_':
out = a|c
return out
#jit
def manyCalls(a, b, c):
res = True
for i in range(1_000_000):
res ^= f(a, b, c ^ res)
return res
t = (True, 'and_', False)
x = manyCalls(*t)
%timeit manyCalls(*t)
Calling manyCalls takes 3.62 ms on my machine. This means each call to f takes 3.6 ns in average (16 cycles). This means the overhead is paid only once (when manyCalls is called).
I have an array of objects. I also have a function that requires information from 2 of the objects at a time. I would like to vectorize the call to the function so that it calculates all calls at once, rather than using a loop to go through the necessary pair of objects.
I have gotten this to work if I instead create an array with the necessary data. However this partially defeats the purpose of using objects.
Here is the code. It currently works using the array method and only one line needs to be commented/uncommented in the function to switch to the "object" mode that does not work, but I dearly wish would.
The error I get is: TypeError: only integer arrays with one element can be converted to an index
import numpy as np
import time as time
class ExampleObject():
def __init__(self, r):
self.r = r
def ExampleFunction(x):
""" WHAT I REALLY WANT """
# answer = exampleList[x].r - exampleList[indexArray].r
"""WHAT I AM STUCK WITH """
answer = coords[x] - exampleList[indexArray].r
return answer
indexArray = 5 #arbitrary choice of array index
sizeArray = 1000
exampleList = []
for i in range(sizeArray):
r = np.random.rand()
exampleList.append( ExampleObject( r ) )
index_list = np.arange(0,sizeArray,1)
index_list = np.delete(index_list,indexArray)
coords = np.array([h.r for h in exampleList])
answerArray = ExampleFunction(index_list)
The issue is that when I pass the function an array of integers, it doesn't return an array of answers (the vectorization I want) when I use the array (actually, list) of objects. It does work if I use an array (with no objects, just data in each element). But as I have said, this defeats in my mind, the purpose of storing information on objects to begin with. Do I really need to ALSO store the same information in arrays?
I can't comment, sorry for misusing the answer section...
If the data type of a numpy array is python object, the memory of the numpy array is not contiguous. Vectorization of the operation may not improve the performance much if any. Perhaps you might want to try numpy structured array instead.
assume the object has attributes a & b and they are double precision floating point number, then...
import numpy as np
numberOfObjects = 6
myStructuredArray = np.zeros(
(numberOfObjects,),
[("a", "f8"), ("b", "f8")],
)
you can initialize individual attributes for say object 0 like this
myStructuredArray["a"][0] = 1.0
or you can initialize individual attributes for all objects like this
myStructuredArray["a"] = [1,2,3,4,5,6]
print(myStructuredArray)
[(1., 0.) (2., 0.) (3., 0.) (4., 0.) (5., 0.) (6., 0.)]
numpy.ufunc when given an object dtype array, iterate through the array, and try to apply a cooresponding method to each element.
For example np.abs tries to apply the __abs__ method. Lets add such a method to your class:
In [31]: class ExampleObject():
...:
...: def __init__(self, r):
...: self.r = r
...: def __abs__(self):
...: return self.r
...:
Now create your arrays:
In [32]: indexArray = 5 #arbitrary choice of array index
...: sizeArray = 10
...:
...: exampleList = []
...: for i in range(sizeArray):
...: r = np.random.rand()
...: exampleList.append( ExampleObject( r ) )
...:
...: index_list = np.arange(0,sizeArray,1)
...: index_list = np.delete(index_list,indexArray)
...:
...: coords = np.array([h.r for h in exampleList])
and make an object dtype array from the list:
In [33]: exampleArr = np.array(exampleList)
In [34]: exampleArr
Out[34]:
array([<__main__.ExampleObject object at 0x7fbb541eb9b0>,
<__main__.ExampleObject object at 0x7fbb541eba90>,
<__main__.ExampleObject object at 0x7fbb541eb3c8>,
<__main__.ExampleObject object at 0x7fbb541eb978>,
<__main__.ExampleObject object at 0x7fbb541eb208>,
<__main__.ExampleObject object at 0x7fbb541eb128>,
<__main__.ExampleObject object at 0x7fbb541eb198>,
<__main__.ExampleObject object at 0x7fbb541eb358>,
<__main__.ExampleObject object at 0x7fbb541eb4e0>,
<__main__.ExampleObject object at 0x7fbb541eb048>], dtype=object)
Now we can get an array of the r values by calling the np.abs function:
In [35]: np.abs(exampleArr)
Out[35]:
array([0.28411876298913485, 0.5807617042932764, 0.30566195995294954,
0.39564156171554554, 0.28951905026871105, 0.5500945908978057,
0.40908712567465855, 0.6469497088949425, 0.7480045751535003,
0.710425181488751], dtype=object)
It also works with indexed elements of the array:
In [36]: np.abs(exampleArr[:3])
Out[36]:
array([0.28411876298913485, 0.5807617042932764, 0.30566195995294954],
dtype=object)
This is convenient, but I can't promise speed. In other tests I found that iteration over object dtypes is faster than iteration (in Python) over numeric array elements, but slower than list iteration.
In [37]: timeit np.abs(exampleArr)
3.61 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [38]: timeit [h.r for h in exampleList]
985 ns ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [39]: timeit np.array([h.r for h in exampleList])
3.55 µs ± 88.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
First sorry for my not perfect English.
My problem is simple to explain, I think.
result={}
list_tuple=[(float,float,float),(float,float,float),(float,float,float)...]#200k tuples
threshold=[float,float,float...] #max 1k values
for tuple in list_tuple:
for value in threeshold:
if max(tuple)>value and min(tuple)<value:
if value in result:
result[value].append(tuple)
else:
result[value]=[]
result[value].append(tuple)
list_tuple contains arround 200k tuples, i have to do this operation very fast(2/3 seconds max on a normal pc).
My first attemp was to do this in cython with prange() (so i could have benefits from the cython optimization and from the paralell execution), but the problem is (as always), GIL: in prange() i can manage lists and tuples using cython memviews, but i can't insert my result in a dict.
In cython i also tried using unordered_map of the c++ std, but now the problem is that i can't make a vector of array in c++ (that would the value of my dict).
The second problem is similar:
list_tuple=[((float,float),(float,float)),((float,float),(float,float))...]#200k tuples of tuples
result={list_tuple[0][0]:[]}
for tuple in list_tuple:
if tuple[0] in result:
result[tuple[0]].append(tuple)
else:
result[tuple[0]]=[]
Here i have also another problem,if a want to use prange() i have to use a custom hash function to use an array as key of a c++ unordered_map
As you can see my snippets are very simple to run in paralell.
I thought to try with numba, but probably will be the same because of GIL, and i prefer to use cython because i need a binary code (this library could be a part of a commercial software so only binary libraries are allowed).
In general i would like avoid c/c++ function, what i hope to find is a way to manage something like dicts/lists in parallel,with the cython performance, remaining as much as possible in the Python domain; but i'm open to every advice.
Thanks
Several performance improvements can be achieved, also by using numpy's vectorization features:
The min and max values are currently computed anew for each threshold. Instead they can be precomputed and then reused for each threshold.
The loop over data samples (list_tuple) is performed in pure Python. This loop can be vectorized using numpy.
In the following tests I used data.shape == (200000, 3); thresh.shape == (1000,) as indicated in the OP. I also omitted modifications to the result dict since depending on the data this can quickly overflow memory.
Applying 1.
v_min = [min(t) for t in data]
v_max = [max(t) for t in data]
for mi, ma in zip(v_min, v_max):
for value in thresh:
if ma > value and mi < value:
pass
This yields a performance increase of ~ 5 compared to the OP's code.
Applying 1. & 2.
v_min = data.min(axis=1)
v_max = data.max(axis=1)
mask = np.empty(shape=(data.shape[0],), dtype=bool)
for t in thresh:
mask[:] = (v_min < t) & (v_max > t)
samples = data[mask]
if samples.size > 0:
pass
This yields a performance increase of ~ 30 compared to the OP's code. This approach has the additional benefit that it doesn't contain incremental appends to the lists which can slow down the program since memory reallocation might be required. Instead it creates each list (per threshold) in a single attempt.
#a_guest's code:
def foo1(data, thresh):
data = np.asarray(data)
thresh = np.asarray(thresh)
condition = (
(data.min(axis=1)[:, None] < thresh)
& (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
return result
This code creates a dictionary entry once for each item in thresh.
The OP code, simplified a bit with default_dict (from collections):
def foo3(list_tuple, threeshold):
result = defaultdict(list)
for tuple in list_tuple:
for value in threeshold:
if max(tuple)>value and min(tuple)<value:
result[value].append(tuple)
return result
This one updates a dictionary entry once for each item that meets the criteria.
And with his sample data:
In [27]: foo1(data,thresh)
Out[27]: {0: [], 1: [[0, 1, 2]], 2: [], 3: [], 4: [[3, 4, 5]]}
In [28]: foo3(data.tolist(), thresh.tolist())
Out[28]: defaultdict(list, {1: [[0, 1, 2]], 4: [[3, 4, 5]]})
time tests:
In [29]: timeit foo1(data,thresh)
66.1 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# In [30]: timeit foo3(data,thresh)
# 161 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [31]: timeit foo3(data.tolist(),thresh.tolist())
30.8 µs ± 56.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Iteration on arrays is slower than with lists. Time for tolist() is minimal; np.asarray for lists is longer.
With a larger data sample, the array version is faster:
In [42]: data = np.random.randint(0,50,(3000,3))
...: thresh = np.arange(50)
In [43]:
In [43]: timeit foo1(data,thresh)
16 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [44]: %%timeit x,y = data.tolist(), thresh.tolist()
...: foo3(x,y)
...:
83.6 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Edit
Since this approach basically performs an outer product between data samples and threshold values it increases the required memory significantly which might be undesired. An improved approach can be found here. I keep this answer nevertheless for future reference since it was referred to in this answer.
I found the performance increase as compared to the OP's code to be a factor of ~ 20.
This is an example using numpy. The data is vectorized and so are the operations. Note that the resulting dict contains empty lists, as opposed to the OP's example, and hence might require an additional cleaning step, if appropriate.
import numpy as np
# Data setup
data = np.random.uniform(size=(200000, 3))
thresh = np.random.uniform(size=1000)
# Compute tuples for thresholds.
condition = (
(data.min(axis=1)[:, None] < thresh)
& (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
I would like to set an entire field of a NumPy structured scalar from within a Numba compiled nopython function. The desired_fn in the code below is a simple example of what I would like to do, and working_fn is an example of how I can currently accomplish this task.
import numpy as np
import numba as nb
test_numpy_dtype = np.dtype([("blah", np.int64)])
test_numba_dtype = nb.from_dtype(test_numpy_dtype)
#nb.njit
def working_fn(thing):
for j in range(len(thing)):
thing[j]['blah'] += j
#nb.njit
def desired_fn(thing):
thing['blah'] += np.arange(len(thing))
a = np.zeros(3,test_numpy_dtype)
print(a)
working_fn(a)
print(a)
desired_fn(a)
The error generated from running desired_fn(a) is:
numba.errors.InternalError: unsupported array index type const('blah') in [const('blah')]
[1] During: typing of staticsetitem at /home/sam/PycharmProjects/ChessAI/playground.py (938)
This is needed for extremely performance critical code, and will be run billions of times, so eliminating the need for these types of loops seems to be crucial.
The following works (numba 0.37):
#nb.njit
def desired_fn(thing):
thing.blah[:] += np.arange(len(thing))
# or
# thing['blah'][:] += np.arange(len(thing))
If you are operating primarily on columns of your data instead of rows, you might consider using a different data container. A numpy structured array is laid out like a vector of structs rather than a struct of arrays. This means that when you want to update blah, you are moving through non-contiguous memory space as you traverse the array.
Also, with any code optimizations, it's aways worth it to use timeit or some other timing harness (that removes the time required to jit the code) to see what is the actual performance. You might find with numba that explicit looping while more verbose could actually be faster than your vectorized code.
Without numba, accessing field values is no slower than accessing columns of a 2d array:
In [1]: arr2 = np.zeros((10000), dtype='i,i')
In [2]: arr2.dtype
Out[2]: dtype([('f0', '<i4'), ('f1', '<i4')])
Modifying a field:
In [4]: %%timeit x = arr2.copy()
...: x['f0'] += 1
...:
16.2 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Similar time if I assign the field to a new variable:
In [5]: %%timeit x = arr2.copy()['f0']
...: x += 1
...:
15.2 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much faster if I construct a 1d array of the same size:
In [6]: %%timeit x = np.zeros(arr2.shape, int)
...: x += 1
...:
8.01 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But similar time when accessing the column of a 2d array:
In [7]: %%timeit x = np.zeros((arr2.shape[0],2), int)
...: x[:,0] += 1
...:
17.3 µs ± 23.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)