Python - In-line boolean evaluation without IF statements - python

I am trying to assess the value of a column of a dataframe to determine the value of another column. I did this by using an if statement and .apply() function successfully. I.e.
if Col x < 0.3:
return y
elif Col x > 0.6:
return z
Etc. The problem is this takes quite a while to run with a lot of data. Instead I am trying to use the following logic to determine the new column value:
(x<0.3)*y + (x>0.6)*z
So Python evaluates TRUE/FALSE and applies the correct value. This seems to work much faster, the only thing is Python says:
"UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
unsupported[op_str]))"
Is this a problem? Should I be using "&"? I feel using "&" would be incorrect when multiplying.
Thank you!

From what I have read so far, the performance gap is issued by the parser backend chosen by pandas. There's the regular python parser as a backand and, additionally, a pandas parsing backend.
The docs say, that there is no performance gain if using plain old python over pandas here: Pandas eval Backends
However, you obviously hit a white spot in the pandas backend; i.e. you formed an expression that cannot be evaluated using pandas. The result is that pandas falls back to the original python parsing backend, as stated in the resulting UserWarning:
UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
unsupported[op_str]))
(More on this topic)
Timing evaluations
So, as we now know about different parsing backends, it's time to check a few options provided by pandas that are suitable for your desired dataframe operation (complete script below):
expr_a = '''(a < 0.3) * 1 + (a > 0.6) * 3 + (a >= 0.3) * (a <= 0.6) * 2'''
Evaluate the expression as a string using the pandas backend
Evaluate the same string using the python backend
Evaluate the expression string with external variable reference using pandas
Solve the problem using df.apply()
Solve the problem using df.applymap()
Direct submission of the expression (no string evaluation)
The results on my machine for a dataframe with 10,000,000 random float values in one column are:
(1) Eval (pd) 0.240498406269
(2) Eval (py) 0.197919774926
(3) Eval # (pd) 0.200814546686
(4) Apply 3.242620778595
(5) ApplyMap 6.542354086152
(6) Direct 0.140075372736
The major points explaining the performance differences are most likely the following:
Using a python function (as in apply() and applymap()) is (of course!) much slower than using functionality completely implemented in C
String evaluation is expensive (see (6) vs (2))
The overhead (1) has over (2) is probably the backend choice and fallback to also using the python backend, because pandas does not evaluate bool * int.
Nothing new, eh?
How to proceed
We basically just proved what our gut feeling was telling us before (namely: pandas chooses the right backend for a task).
As a consequence, I think it is totally okay to ignore the UserWarning, as long as you know the underlying hows and whys.
Thus: Keep going and have pandas use the fastest of all implementations, which is, as usual, the C functions.
The Test Script
from __future__ import print_function
import sys
import random
import pandas as pd
import numpy as np
from timeit import default_timer as timer
def conditional_column(val):
if val < 0.3:
return 1
elif val > 0.6:
return 3
return 2
if __name__ == '__main__':
nr = 10000000
df = pd.DataFrame({
'a': [random.random() for _ in range(nr)]
})
print(nr, 'rows')
expr_a = '''(a < 0.3) * 1 + (a > 0.6) * 3 + (a >= 0.3) * (a <= 0.6) * 2'''
expr_b = '''(#df.a < 0.3) * 1 + (#df.a > 0.6) * 3 + (#df.a >= 0.3) * (#df.a <= 0.6) * 2'''
fmt = '{:16s} {:.12f}'
# Evaluate the string expression using pandas parser
t0 = timer()
b = df.eval(expr_a, parser='pandas')
print(fmt.format('(1) Eval (pd)', timer() - t0))
# Evaluate the string expression using python parser
t0 = timer()
c = df.eval(expr_a, parser='python')
print(fmt.format('(2) Eval (py)', timer() - t0))
# Evaluate the string expression using pandas parser with external variable access (#)
t0 = timer()
d = df.eval(expr_b, parser='pandas')
print(fmt.format('(3) Eval # (pd)', timer() - t0))
# Use apply to map the if/else function to each row of the df
t0 = timer()
d = df['a'].apply(conditional_column)
print(fmt.format('(4) Apply', timer() - t0))
# Use element-wise apply (WARNING: requires a dataframe and walks ALL cols AND rows)
t0 = timer()
e = df.applymap(conditional_column)
print(fmt.format('(5) ApplyMap', timer() - t0))
# Directly access the pandas series objects returned by boolean expressions on columns
t0 = timer()
f = (df['a'] < 0.3) * 1 + (df['a'] > 0.6) * 3 + (df['a'] >= 0.3) * (df['a'] <= 0.6) * 2
print(fmt.format('(6) Direct', timer() - t0))

Related

Using sympy cxxcode on Heaviside function fails

I am using sympy version 1.10.1 and numpy version 1.20.0. Can someone please explain why the following simple code results in an error?
import sympy as sp
from sympy.printing import cxxcode
T = sp.symbols('T', real=True, finite=True)
func = sp.Heaviside(T - 0.01)
func_cxx = cxxcode(func)
The error is
ValueError: All Piecewise expressions must contain an (expr, True) statement to be used as a default condition. Without one, the generated expression may not evaluate to anything under some condition.
I understand that sympy converts Heaviside to a Piecewise function, but I'd imagine the corresponding Piecewise is also defined for all real & finite T:
>>func.rewrite(sp.Piecewise)
>>Piecewise((0, T - 0.01 < 0), (1/2, Eq(T - 0.01, 0)), (1, T - 0.01 > 0))
If I were you I would open an issue on SymPy.
To solve your problem you might want to modify the function in order to get cxxcode to work:
func = func.rewrite(Piecewise)
# modify the last argument to insert the True condition
args = list(func.args)
args[-1] = [args[-1][0], True]
# recreate the function
func = Piecewise(*args)
print(cxxcode(func))
# out: '((T - 0.01 < 0) ? (\n 0\n)\n: ((T - 0.01 == 0) ? (\n 1.0/2.0\n)\n: (\n 1\n)))'

Python 3.X: Why numexpr.evaluate() slower than eval()?

The purpose of using the numexpr.evaluate() is to speed up the compute. But im my case it wokrs even slower than numpy und eval(). I would like to know why?
code as example:
import datetime
import numpy as np
import numexpr as ne
expr = '11808000.0*1j*x**2*exp(2.5e-10*1j*x) + 1512000.0*1j*x**2*exp(5.0e-10*1j*x)'
# use eval
start_eval = datetime.datetime.now()
namespace = dict(x=np.array([m+3j for m in range(1, 1001)]), exp=np.exp)
result_eval = eval(expr, namespace)
end_eval = datetime.datetime.now()
# print(result)
print("time by using eval : %s" % (end_eval- start_eval))
# use numexpr
# ne.set_num_threads(8)
start_ne = datetime.datetime.now()
x = np.array([n+3j for n in range(1, 1001)])
result_ne = ne.evaluate(expr)
end_ne = datetime.datetime.now()
# print(result_ne)
print("time by using numexpr: %s" % (end_ne- start_ne))
return:
time by using eval : 0:00:00.002998
time by using numexpr: __ 0:00:00.052969
Thank you all
I have get the answer from robbmcleod in Github
for NumExpr 2.6 you'll need arrays around 128 - 256 kElements to see a speed-up. NumPy always starts faster as it doesn't have to synchronize at a thread barrier and otherwise spin-up the virtual machine
Also, once you call numexpr.evaluate() the second time, it should be faster as it will have already compiled everything. Compilation takes around 0.5 ms for simple expressions, more for longer expressions. The expression is stored as a hash, so the next time that computational expense is gone.
related url: https://github.com/pydata/numexpr/issues/301#issuecomment-388097698

What is the fastest way to sort strings in Python if locale is a non-concern?

I was trying to find a fast way to sort strings in Python and the locale is a non-concern i.e. I just want to sort the array lexically according to the underlying bytes. This is perfect for something like radix sort. Here is my MWE
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = randChar("id%010d", N//K, N) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
As you can see it took 6.8 seconds which is almost 10x slower than R's radix sort below.
N = 1e7
K = 100
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE)
system.time(sort(id3,method="radix"))
I understand that Python's .sort() doesn't use radix sort, is there an implementation somewhere that allows me to sort strings as performantly as R?
AFAIK both R and Python "intern" strings so any optimisations in R can also be done in Python.
The top google result for "radix sort strings python" is this gist which produced an error when sorting on my test array.
It is true that R interns all strings, meaning it has a "global character cache" which serves as a central dictionary of all strings ever used by your program. This has its advantages: the data takes less memory, and certain algorithms (such as radix sort) can take advantage of this structure to achieve higher speed. This is particularly true for the scenarios such as in your example, where the number of unique strings is small relative to the size of the vector. On the other hand it has its drawbacks too: the global character cache prevents multi-threaded write access to character data.
In Python, afaik, only string literals are interned. For example:
>>> 'abc' is 'abc'
True
>>> x = 'ab'
>>> (x + 'c') is 'abc'
False
In practice it means that, unless you've embedded data directly into the text of the program, nothing will be interned.
Now, for your original question: "what is the fastest way to sort strings in python"? You can achieve very good speeds, comparable with R, with python datatable package. Here's the benchmark that sorts N = 10⁸ strings, randomly selected from a set of 1024:
import datatable as dt
import pandas as pd
import random
from time import time
n = 10**8
src = ["%x" % random.getrandbits(10) for _ in range(n)]
f0 = dt.Frame(src)
p0 = pd.DataFrame(src)
f0.to_csv("test1e8.csv")
t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0))
t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0))
t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))
Which produces:
datatable: 1.465s / 1.462s / 1.460s (multiple runs)
list.sort: 44.352s
pandas: 395.083s
The same dataset in R (v3.4.2):
> require(data.table)
> DT = fread("test1e8.csv")
> system.time(sort(DT$C1, method="radix"))
user system elapsed
6.238 0.585 6.832
> system.time(DT[order(C1)])
user system elapsed
4.275 0.457 4.738
> system.time(setkey(DT, C1)) # sort in-place
user system elapsed
3.020 0.577 3.600
Jeremy Mets posted in the comments of this blog post that Numpy can sort string fairly by converting the array to np.araray. This indeed improve performance, however it is still slower than Julia's implementation.
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = np.array(randChar("id%010d", N//K, N)) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds

Why single addition takes longer than single additions plus single assignment?

Here is the python code, I use python 3.5.2/Intel(R) Core(TM) i7-4790K CPU # 4.00GHz :
import time
empty_loop_t = 0.14300823211669922
N = 10000000
def single_addition(n):
a = 1.0
b = 0.0
start_t = time.time()
for i in range(0, n):
a + b
end_t = time.time()
cost_t = end_t - start_t - empty_loop_t
print(n,"iterations single additions:", cost_t)
return cost_t
single_addition(N)
def single_addition_plus_single_assignment(n):
a = 1.0
b = 0.0
c = 0.0
start_t = time.time()
for i in range(0, n):
c = a + b
end_t = time.time()
cost_t = end_t - start_t - empty_loop_t
print(n,"iterations single additions and single assignments:", cost_t)
return cost_t
single_addition_plus_single_assignment(N)
The output is:
10000000 iterations single additions: 0.19701123237609863
10000000 iterations single additions and single assignments: 0.1890106201171875
Normally, to get a more reliable result, it is better to do the test using K-fold. However, since K-fold loop itself has influence on the result, I don't use it in my test. And I'm sure this inequality can be reproduced, at least on my machine. So the question is why this happened?
I run it with pypy (had to set empty_loop_t = 0) and got the following results:
(10000000, 'iterations single additions:', 0.014394044876098633)
(10000000, 'iterations single additions and single assignments:', 0.018398046493530273)
So I guess it's up to what interpreter does with the source code and how interpreter executes it. It might be that deliberate assignment takes less operations and workload than disposing of the result with non-JIT interpreter while JIT-compiler forces the code to perform the actual number of operations.
Furthermore, the use of JIT-interpreter makes your script run ~50 times faster on my configuration. If you general aim is to optimize the running time of your script you are probably to look that way.

How to use sympy codegen with expressions that contain implemented functions

Im trying to compile an expression that contains an UndefinedFunction which has an implementation provided. (Alternatively: an expression which contains a Symbol which represents a call to an external numerical function)
Is there a way to do this? Either with autowrap or codegen or perhaps with some manual editing of the generated files?
The following naive example does not work:
import sympy as sp
import numpy as np
from sympy.abc import *
from sympy.utilities.lambdify import implemented_function
from sympy.utilities.autowrap import autowrap, ufuncify
def g_implementation(a):
"""represents some numerical function"""
return a*5
# sympy wrapper around the above function
g = implemented_function('g', g_implementation)
# some random expression using the above function
e = (x+g(a))**2+100
# try to compile said expression
f = autowrap(e, backend='cython')
# fails with "undefined reference to `g'"
EDIT:
I have several large Sympy expressions
The expressions are generated automatically (via differentiation and such)
The expressions contain some "implemented UndefinedFunctions" that call some numerical functions (i.e. NOT Sympy expressions)
The final script/program that has to evaluate the expressions for some input will be called quite often. That means that evaluating the expression in Sympy (via evalf) is definitely not feasible. Even compiling just in time (lambdify, autowrap, ufuncify, numba.jit) produces too much of an overhead.
Basically I want to create a binary python extension for those expressions without having to implement them by hand in C, which I consider too error prone.
OS is Windows 7 64bit
You may want to take a look at this answer about serialization of SymPy lambdas (generated by lambdify).
It's not exactly what you asked but may alleviate your problem with startup performance. Then lambdify-ied functions will mostly count only execution time.
You may also take a look at Theano. It has nice integration with SymPy.
Ok, this could do the trick, I hope, if not let me know and I'll try again.
I compare a compiled version of an expression using Cython against the lambdified expression.
from sympy.utilities.autowrap import autowrap
from sympy import symbols, lambdify
def wraping(expression):
return autowrap(expression, backend='cython')
def lamFunc(expression, x, y):
return lambdify([x,y], expr)
x, y = symbols('x y')
expr = ((x - y)**(25)).expand()
print expr
binary_callable = wraping(expr)
print binary_callable(1, 2)
lamb = lamFunc(expr, x, y)
print lamb(1,2)
which outputs:
x**25 - 25*x**24*y + 300*x**23*y**2 - 2300*x**22*y**3 + 12650*x**21*y**4 - 53130*x**20*y**5 + 177100*x**19*y**6 - 480700*x**18*y**7 + 1081575*x**17*y**8 - 2042975*x**16*y**9 + 3268760*x**15*y**10 - 4457400*x**14*y**11 + 5200300*x**13*y**12 - 5200300*x**12*y**13 + 4457400*x**11*y**14 - 3268760*x**10*y**15 + 2042975*x**9*y**16 - 1081575*x**8*y**17 + 480700*x**7*y**18 - 177100*x**6*y**19 + 53130*x**5*y**20 - 12650*x**4*y**21 + 2300*x**3*y**22 - 300*x**2*y**23 + 25*x*y**24 - y**25
-1.0
-1
If I time the execution times the autowraped function is 10x faster (depending on the problem I have also observed cases where the factor was as little as two):
%timeit binary_callable(12, 21)
100000 loops, best of 3: 2.87 µs per loop
%timeit lamb(12, 21)
100000 loops, best of 3: 28.7 µs per loop
So here wraping(expr) wraps your expression expr and returns a wrapped object binary_callable. This you can use at any time to do numerical evaluation.
EDIT: I have done this on Linux/Ubuntu and Windows OS, both seem to work fine!

Categories