Pandas Dataframe Performance apply function with shift

Pandas Dataframe Performance apply function with shift - python

I'm trying to optimize my code a bit. One call is pretty fast, but since it is often I got some issues.
My input data looks like this:
df = pd.DataFrame(data=np.random.randn(30),
index=pd.date_range(pd.datetime(2016,1,1), periods = 30))
df.iloc[:20] = np.nan
Now I just want to apply a simple function. Here is the part I want to optimize:
s = df >= df.shift(1)
s = s.applymap(lambda x: 1 if x else 0)
Right now I'm getting 1000 loops, best of 3: 1.36 ms per loop. I guess it should be possible to do it much faster. Not sure if I should vectorize, work only with numpy or maybe use cython. Any Idea for the best approach? I struggle a bit with the shift operator.

You can cast the result of your comparison directly from bool to int:
(df >= df.shift(1)).astype(int)

#Paul H's answer is good, performant and what I'd generally recommend.
That said, if you want to squeeze every last bit of performance, this is a decent candidate for numba which you can use to compute the answer in a single pass over the data.
from numba import njit
#njit
def do_calc(arr):
N = arr.shape[0]
ans = np.empty(N, dtype=np.int_)
ans[0] = 0
for i in range(1, N):
ans[i] = 1 if arr[i] > arr[i-1] else 0
return ans
a = (df >= df.shift(1)).astype(int)
b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
from pandas.testing import assert_frame_equal
assert_frame_equal(a, b)
Here are timings
In [45]: %timeit b = pd.DataFrame(pd.Series(do_calc(df[0].values), df[0].index))
135 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [46]: %timeit a = (df >= df.shift(1)).astype(int)
762 µs ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thats my current best solution:
values = df.values[1:] >= df.values[:-1]
data = np.array(values, dtype=int)
s = pd.DataFrame(data, df.index[1:])
I'm getting 10000 loops, best of 3: 125 µs per loop. x10 improvement. But I think it could be done even faster.
PS: this solution isn't exactly correct since the first zero / nan is missing.
PPS: that can be corrected by pd.DataFrame(np.append([[0]],data), df.index)

Related

Is the any-call on a boolean series/array efficient?

Say we have a call like:
ser = pd.Series([1,2,3,4])
ser[ser>1].any()
Now my question is: Is pandas "smart enough" to stop computation and spit out the "true" when it encounters the 2 or does it really go through the whole array first and checks the any() afterwards. If the latter is true: How to avoid this behavior?

Is pandas "smart enough" to stop computation
pandas acts differently, it widely uses vectorized operations (apply one operation/function to a sequence of values at once) and the mentioned expression ser[ser>1].any() implies:
ser > 1 - evaluates on the whole series and returns a boolean array of results (for each value) array([False, True, True, True])
ser[ser > 1] - filters the series by boolean array
.any() - finally evaluates function on the filtered series
Actually, your intention is covered by (ser > 1).any() (without interim filtering).
If you expect a classical any behavior (to immediately return on encountering True) you can go respectively in classical way:
any(x > 1 for x in ser)
And, of course, the classical way in this case will go faster:
In [409]: %timeit (ser > 1).any()
75.4 µs ± 636 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [410]: %timeit any(x > 1 for x in ser)
1.44 µs ± 22.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

How to remove nested for loop?

I have the following nested loop:
sum_tot = 0.0
for i in range(len(N)-1):
for j in range(len(N)-1):
sum_tot = sum_tot + N[i]**2*N[j]**2*W[i]*W[j]*x_i[j][-1] / (N[j]**2 - x0**2) *(z_i[i][j] - z_j[i][j])*x_j[i][-1] / (N[i]**2 - x0**2)
It's basically a mathematical function that has a double summation. Each sum goes up to the length of N. I've been trying to figure out if there was a way to write this without using a nested for-loop in order to reduce computational time. I tried using list comprehension, but the computational time is similar if not the same. Is there a way to write this expression as matrices to avoid the loops?

Note that range will stop at N-2 given your current loop: range goes up to but not including its argument. You probably mean to write for i in range(len(N)).
It's also difficult to reduce summation: the actual time it takes is based on the number of terms computed, so if you write it a different way which still involves the same number of terms, it will take just as long. However, O(n^2) isn't exactly bad: it looks like the best you can do in this situation unless you find a mathematical simplification of the problem.
You might consider checking this post to gather ways to write out the summation in a neater fashion.

#Kraigolas makes valid points. But let's try a few benchmarks on a dummy, double nested operation, either way. (Hint: Numba might help you speed things up)
Note, I would avoid numpy arrays specifically because all of the cross-product between the range is going to be in memory at once. If this is a massive range, you may run out of memory.
Nested for loops
n = 5000
s1 = 0
for i in range(n):
for j in range(n):
s1 += (i/2) + (j/3)
print(s1)
#2.26 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
List comprehension
n = 5000
s2 = 0
s2 = sum([i/2+j/3 for i in range(n) for j in range(n)])
print(s2)
#3.2 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Itertools product
from itertools import product
n = 5000
s3 = 0
for i,j in product(range(n),repeat=2):
s3 += (i/2) + (j/3)
print(s3)
#2.35 s ± 186 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note: When using Numba, you would want to run the code at least once before, because the first time it compiles the code and therefore the speed is slow. The real speedup comes second run onwards.
Numba njit (SIMD)
from numba import njit
n=5000
#njit
def f(n):
s = 0
for i in range(n):
for j in range(n):
s += (i/2) + (j/3)
return s
s4 = f(n)
#29.4 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba njit parallel with prange
An excellent suggestion by #Tim, added to benchmarks
#njit(parallel=True)
def f(n):
s = 0
for i in prange(n):
for j in prange(n):
s += (i/2) + (j/3)
return s
s5 = f(n)
#21.8 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Significant boost up with Numba as expected. Maybe try that?

To convert this to matrix calculations, I would suggest combine some terms first.
If these objects are not numpy arrays, it's better to convert them to numpy arrays, as they support element-wise operations.
To convert, simply do
import numpy
N = numpy.array(N)
w = numpy.array(w)
x_i = numpy.array(x_i)
x_j = numpy.array(x_j)
z_i = numpy.array(z_i)
z_j = numpy.array(z_j)
Then,
common_terms = N**2*w/(N**2-x0**2)
i_terms = common_terms*x_j[:,-1]
j_terms = common_terms*x_i[:,-1]
i_j_matrix = z_i - z_j
sum_output = (i_terms.reshape((1,-1)) # i_j_matrix # j_terms.reshape((-1,1)))[0,0]

key in s vs key not in s speed

I have this code:
s = set([5,6,7,8])
if key in s:
return True
if key not in s:
return False
It seems to me that it shouldn't, in theory, differ time wise, but I may be missing something under the hood.
Is there any reason to prefer one over the other in terms of processing time or readability?
Perhaps is this an example of:
"Premature optimization is the root of all evil"?
Short Answer: No, no difference. Yes, probably premature optimization.
OK, I ran this test:
import random
s = set([5,6,7,8])
for _ in range(5000000):
s.add(random.randint(-100000,100000000))
def test_in():
count = 0
for _ in range(50000):
if random.randint(-100000,100000000) in s:
count += 1
print(count)
def test_not_in():
count = 0
for _ in range(50000):
if random.randint(-100000,100000000) not in s:
count += 1
print(count)
When I time the outputs:
%timeit test_in()
10 loops, best of 3: 83.4 ms per loop
%timeit test_not_in()
10 loops, best of 3: 78.7 ms per loop
BUT, that small difference seems to be a symptom of counting the components. There are an average of 47500 "not ins" but only 2500 "ins". If I change both tests to pass, e.g.:
def test_in():
for _ in range(50000):
if random.randint(-100000,100000000) in s:
pass
The results are nearly identical
%timeit test_in()
10 loops, best of 3: 77.4 ms per loop
%timeit test_not_in()
10 loops, best of 3: 78.7 ms per loop
In this case, my intuition failed me. I had thought that saying it is not in the set could have had added some additional processing time. When I further consider what a hashmap does, it seems obvious that this can't be the case.

You shouldn't see a difference. The lookup time in a set is constant. You hash the entry, then look it up in a hashmap. All keys are checked in the same time, and in vs not in should be comparable.

Running a simple performance test in an ipython session with timeit confirms g.d.d.c's statement.
def one(k, s):
if k in s:
return True
def two(k, s):
if k not in s:
return False
s = set(range(1, 100))
%timeit -r7 -n 10000000 one(50, s)
## 83.7 ns ± 0.874 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit -r7 -n 10000000 two(50, s)
## 86.1 ns ± 1.11 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Optimisations such as this aren't going to gain you a lot, and as has been pointed out in the comments will in fact reduce the speed at which you'll push out bugfixes/improvements/... due to bad readability. For this type of low-level performance gains, I'd suggest looking into Cython or Numba.

Setting structured array field in Numba

I would like to set an entire field of a NumPy structured scalar from within a Numba compiled nopython function. The desired_fn in the code below is a simple example of what I would like to do, and working_fn is an example of how I can currently accomplish this task.
import numpy as np
import numba as nb
test_numpy_dtype = np.dtype([("blah", np.int64)])
test_numba_dtype = nb.from_dtype(test_numpy_dtype)
#nb.njit
def working_fn(thing):
for j in range(len(thing)):
thing[j]['blah'] += j
#nb.njit
def desired_fn(thing):
thing['blah'] += np.arange(len(thing))
a = np.zeros(3,test_numpy_dtype)
print(a)
working_fn(a)
print(a)
desired_fn(a)
The error generated from running desired_fn(a) is:
numba.errors.InternalError: unsupported array index type const('blah') in [const('blah')]
[1] During: typing of staticsetitem at /home/sam/PycharmProjects/ChessAI/playground.py (938)
This is needed for extremely performance critical code, and will be run billions of times, so eliminating the need for these types of loops seems to be crucial.

The following works (numba 0.37):
#nb.njit
def desired_fn(thing):
thing.blah[:] += np.arange(len(thing))
# or
# thing['blah'][:] += np.arange(len(thing))
If you are operating primarily on columns of your data instead of rows, you might consider using a different data container. A numpy structured array is laid out like a vector of structs rather than a struct of arrays. This means that when you want to update blah, you are moving through non-contiguous memory space as you traverse the array.
Also, with any code optimizations, it's aways worth it to use timeit or some other timing harness (that removes the time required to jit the code) to see what is the actual performance. You might find with numba that explicit looping while more verbose could actually be faster than your vectorized code.

Without numba, accessing field values is no slower than accessing columns of a 2d array:
In [1]: arr2 = np.zeros((10000), dtype='i,i')
In [2]: arr2.dtype
Out[2]: dtype([('f0', '<i4'), ('f1', '<i4')])
Modifying a field:
In [4]: %%timeit x = arr2.copy()
...: x['f0'] += 1
...:
16.2 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Similar time if I assign the field to a new variable:
In [5]: %%timeit x = arr2.copy()['f0']
...: x += 1
...:
15.2 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much faster if I construct a 1d array of the same size:
In [6]: %%timeit x = np.zeros(arr2.shape, int)
...: x += 1
...:
8.01 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But similar time when accessing the column of a 2d array:
In [7]: %%timeit x = np.zeros((arr2.shape[0],2), int)
...: x[:,0] += 1
...:
17.3 µs ± 23.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Fast check for NaN in NumPy

I'm looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.
I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)

Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:
In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop
In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
Unlike min, sum doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum's throughput seems constant regardless of whether there are NaNs and where they're located:
In [40]: x = np.random.rand(100000)
In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop
In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In [43]: x[50000] = np.nan
In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop
In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop
In [46]: x[0] = np.nan
In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop
In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

I think np.isnan(np.min(X)) should do what you want.

There are two general approaches here:
Check each array item for nan and take any.
Apply some cumulative operation that preserves nans (like sum) and check its result.
While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.
import numpy as np
import perfplot
def min(a):
return np.isnan(np.min(a))
def sum(a):
return np.isnan(np.sum(a))
def dot(a):
return np.isnan(np.dot(a, a))
def any(a):
return np.any(np.isnan(a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
b = perfplot.bench(
setup=np.random.rand,
kernels=[min, sum, dot, any, einsum],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
)
b.save("out.png")
b.show()

Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):
In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop
In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop
Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.

If you're comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:
import numba as nb
import math
#nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False
If there is no NaN the function might actually be slower than np.min, I think that's because np.min uses multiprocessing for large arrays:
import numpy as np
array = np.random.random(2000000)
%timeit anynan(array) # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.64 ms per loop
But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:
array = np.random.random(2000000)
array[100] = np.nan
%timeit anynan(array) # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.65 ms per loop
Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.

use .any()
if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking
if not np.isfinite(prop).all()

Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:
index = next((i for (i,n) in enumerate(iterable) if n!=n), None)

Adding to #nico-schlömer and #mseifert 's answers, I computed the performance of a numba-test has_nan with early stops, compared to some of the functions that will parse the full array.
On my machine, for an array without nans, the break-even happens for ~10^4 elements.
import perfplot
import numpy as np
import numba
import math
def min(a):
return np.isnan(np.min(a))
def dot(a):
return np.isnan(np.dot(a, a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
#numba.njit
def has_nan(a):
for i in range(a.size - 1):
if math.isnan(a[i]):
return True
return False
def array_with_missing_values(n, p):
""" Return array of size n, p : nans ( % of array length )
Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
a = np.random.rand(n)
p = np.random.randint(0, len(a), int(p*len(a)/100))
a[p] = np.nan
return a
#%%
perfplot.show(
setup=lambda n: array_with_missing_values(n, 0),
kernels=[min, dot, has_nan],
n_range=[2 ** k for k in range(20)],
logx=True,
logy=True,
xlabel="len(a)",
)
What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.
For arrays of length 1,000,000, has_nan becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.
#%%
N = 1000000 # 100000
perfplot.show(
setup=lambda p: array_with_missing_values(N, p),
kernels=[min, dot, has_nan],
n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01,
logy=True,
xlabel=f"% of nan in array (N = {N})",
)
If in your application most arrays have nan and you're looking for ones without, then has_nan is the best approach.
Else; dot seems to be the best option.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe Performance apply function with shift - python

You can cast the result of your comparison directly from bool to int: (df >= df.shift(1)).astype(int)

Related

Is the any-call on a boolean series/array efficient?

How to remove nested for loop?

key in s vs key not in s speed

Setting structured array field in Numba

Fast check for NaN in NumPy

Categories

Resources