Python 3.X: Why numexpr.evaluate() slower than eval()? - python

The purpose of using the numexpr.evaluate() is to speed up the compute. But im my case it wokrs even slower than numpy und eval(). I would like to know why?
code as example:
import datetime
import numpy as np
import numexpr as ne
expr = '11808000.0*1j*x**2*exp(2.5e-10*1j*x) + 1512000.0*1j*x**2*exp(5.0e-10*1j*x)'
# use eval
start_eval = datetime.datetime.now()
namespace = dict(x=np.array([m+3j for m in range(1, 1001)]), exp=np.exp)
result_eval = eval(expr, namespace)
end_eval = datetime.datetime.now()
# print(result)
print("time by using eval : %s" % (end_eval- start_eval))
# use numexpr
# ne.set_num_threads(8)
start_ne = datetime.datetime.now()
x = np.array([n+3j for n in range(1, 1001)])
result_ne = ne.evaluate(expr)
end_ne = datetime.datetime.now()
# print(result_ne)
print("time by using numexpr: %s" % (end_ne- start_ne))
return:
time by using eval : 0:00:00.002998
time by using numexpr: __ 0:00:00.052969

Thank you all
I have get the answer from robbmcleod in Github
for NumExpr 2.6 you'll need arrays around 128 - 256 kElements to see a speed-up. NumPy always starts faster as it doesn't have to synchronize at a thread barrier and otherwise spin-up the virtual machine
Also, once you call numexpr.evaluate() the second time, it should be faster as it will have already compiled everything. Compilation takes around 0.5 ms for simple expressions, more for longer expressions. The expression is stored as a hash, so the next time that computational expense is gone.
related url: https://github.com/pydata/numexpr/issues/301#issuecomment-388097698

Related

Numba Python - how to exploit parallelism effectively?

I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance

Cython Optimization of Numpy for Loop

I am new to cython and have the following code for a numpy for loop that I am trying to optimize. So far, this Cython code isn't much faster than the numpy for loop.
# cython: infer_types = True
import numpy as np
cimport numpy
DTYPE = np.double
def hdcfTransfomation(scanData):
cdef Py_ssize_t Position
scanLength = scanData.shape[0]
hdcfFunction_np = np.zeros(scanLength, dtype = DTYPE)
cdef double [::1] hdcfFunction = hdcfFunction_np
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
I know that using C math library functions would be more ideal than calling back into the numpy language (subtract, square, mean), but I am not sure where I can find a list of functions that can be called in this manner.
I have been trying to figure out ways to optimize this code by using different types, ect. but nothing is providing the performance that I think is possible with a fully optimized implementation of Cython.
Here is a working example of the numpy for-loop:
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength)
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
scanDataArray = np.random.rand(80000, 1)
transformedScan = hdcfTransformed(scanDataArray)
Always provide as much informations as possible (some example data, Python/Cython Version, Compiler Version/Settings and CPU Model.
Without that it is quite hard to compare any timings. For example this problem benefits quite well from SIMD-vectorization. It will make quite a difference which compiler you use or if you want to redistribute a compiled version which should also run on low-end or quite old CPUS (eg. no AVX).
I am not very familiar with Cython, but I think your main problem is the missing declaration for scanData. Maybe the C-Compiler needs additional flags like march=native, but the real syntax is compiler dependend. I am am also not sure how Cython or the C-compiler optimizes this part:
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
If that loops (all vectorized commands are actually loops) are not joined, but intead there are temporary arryas like in pure Python created, this will slow down the code. It will be a good idea to create a 1D array first. (eg. scanData=scanData[::1]
As said I am not that familliar with Cython, I tried what is possible with Numba. At least it shows what should also be possible with a resonable good Cython implementation.
Maybe easier to otimize for the compiler
import numba as nb
import numpy as np
#nb.njit(fastmath=True,error_model='numpy',parallel=True)
#scanData is a 1D-array here
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength, dtype = scanData.dtype)
for position in nb.prange(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:scanData.shape[0]-(position + 1)]
sum=0.
jj=0
for i in range(scanLength-(position + 1)):
jj+=1
sum+=(topShift[i]-bottomShift[i])**2
hdcfFunction[position] = sum/jj
return hdcfFunction
I also used parallelization here, because the problem is embarrassingly parallel. At least with a size of 80_000 and Numba it doesn't matter if you use a slightly modified version of your code (1D-array), or the code above.
Timings
#Quadcore Core i7-4th gen,Numba 0.4dev,Python 3.6
scanData=np.random.rand(80_000)
#The first call to the function isn't measured (compilation overhead),but the following calls.
Pure Python: 5900ms
Numba single-threaded: 947ms
Numba parallel: 260ms
Especially for larger arrays than np.random.rand(80_000) there may be better aproaches (loop tilling for better cache usage), but for this size that should be more or less OK (At least it fits in the L3-cache)
Naive GPU Implementation
from numba import cuda, float32
#cuda.jit('void(float32[:], float32[:])')
def hdcfTransfomation_gpu(scanData,out_data):
scanLength = scanData.shape[0]
position = cuda.grid(1)
if position < scanLength - 1:
sum= float32(0.)
offset=1 + position
for i in range(scanLength-offset):
sum+=(scanData[i+offset]-scanData[i])**2
out_data[position] = sum/(scanLength-offset)
hdcfTransfomation_gpu[scanData.shape[0]//64,64](scanData,res_3)
This gives about 400ms on a GT640 (float32) and 970ms (float64). For a good implemenation shared arrays should be considered.
Putting cython aside, does this do the same thing as your current code but without a for loop? We can tighten it up and correct for inaccuracies, but the first port of call is to try apply operations in numpy to 2D arrays before turning to cython for for loops. It's too long to put in a comment.
import numpy as np
# Setup
arr = np.random.choice(np.arange(10), 100).reshape(10, 10)
top_shift = arr[:, :-1]
bottom_shift = arr[:, 1:]
arr_diff = top_shift - bottom_shift
arr_squared = np.square(arr_diff)
arr_mean = arr_squared.mean(axis=1)

What is the fastest way to sort strings in Python if locale is a non-concern?

I was trying to find a fast way to sort strings in Python and the locale is a non-concern i.e. I just want to sort the array lexically according to the underlying bytes. This is perfect for something like radix sort. Here is my MWE
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = randChar("id%010d", N//K, N) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
As you can see it took 6.8 seconds which is almost 10x slower than R's radix sort below.
N = 1e7
K = 100
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE)
system.time(sort(id3,method="radix"))
I understand that Python's .sort() doesn't use radix sort, is there an implementation somewhere that allows me to sort strings as performantly as R?
AFAIK both R and Python "intern" strings so any optimisations in R can also be done in Python.
The top google result for "radix sort strings python" is this gist which produced an error when sorting on my test array.
It is true that R interns all strings, meaning it has a "global character cache" which serves as a central dictionary of all strings ever used by your program. This has its advantages: the data takes less memory, and certain algorithms (such as radix sort) can take advantage of this structure to achieve higher speed. This is particularly true for the scenarios such as in your example, where the number of unique strings is small relative to the size of the vector. On the other hand it has its drawbacks too: the global character cache prevents multi-threaded write access to character data.
In Python, afaik, only string literals are interned. For example:
>>> 'abc' is 'abc'
True
>>> x = 'ab'
>>> (x + 'c') is 'abc'
False
In practice it means that, unless you've embedded data directly into the text of the program, nothing will be interned.
Now, for your original question: "what is the fastest way to sort strings in python"? You can achieve very good speeds, comparable with R, with python datatable package. Here's the benchmark that sorts N = 10⁸ strings, randomly selected from a set of 1024:
import datatable as dt
import pandas as pd
import random
from time import time
n = 10**8
src = ["%x" % random.getrandbits(10) for _ in range(n)]
f0 = dt.Frame(src)
p0 = pd.DataFrame(src)
f0.to_csv("test1e8.csv")
t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0))
t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0))
t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))
Which produces:
datatable: 1.465s / 1.462s / 1.460s (multiple runs)
list.sort: 44.352s
pandas: 395.083s
The same dataset in R (v3.4.2):
> require(data.table)
> DT = fread("test1e8.csv")
> system.time(sort(DT$C1, method="radix"))
user system elapsed
6.238 0.585 6.832
> system.time(DT[order(C1)])
user system elapsed
4.275 0.457 4.738
> system.time(setkey(DT, C1)) # sort in-place
user system elapsed
3.020 0.577 3.600
Jeremy Mets posted in the comments of this blog post that Numpy can sort string fairly by converting the array to np.araray. This indeed improve performance, however it is still slower than Julia's implementation.
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = np.array(randChar("id%010d", N//K, N)) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds

How to use sympy codegen with expressions that contain implemented functions

Im trying to compile an expression that contains an UndefinedFunction which has an implementation provided. (Alternatively: an expression which contains a Symbol which represents a call to an external numerical function)
Is there a way to do this? Either with autowrap or codegen or perhaps with some manual editing of the generated files?
The following naive example does not work:
import sympy as sp
import numpy as np
from sympy.abc import *
from sympy.utilities.lambdify import implemented_function
from sympy.utilities.autowrap import autowrap, ufuncify
def g_implementation(a):
"""represents some numerical function"""
return a*5
# sympy wrapper around the above function
g = implemented_function('g', g_implementation)
# some random expression using the above function
e = (x+g(a))**2+100
# try to compile said expression
f = autowrap(e, backend='cython')
# fails with "undefined reference to `g'"
EDIT:
I have several large Sympy expressions
The expressions are generated automatically (via differentiation and such)
The expressions contain some "implemented UndefinedFunctions" that call some numerical functions (i.e. NOT Sympy expressions)
The final script/program that has to evaluate the expressions for some input will be called quite often. That means that evaluating the expression in Sympy (via evalf) is definitely not feasible. Even compiling just in time (lambdify, autowrap, ufuncify, numba.jit) produces too much of an overhead.
Basically I want to create a binary python extension for those expressions without having to implement them by hand in C, which I consider too error prone.
OS is Windows 7 64bit
You may want to take a look at this answer about serialization of SymPy lambdas (generated by lambdify).
It's not exactly what you asked but may alleviate your problem with startup performance. Then lambdify-ied functions will mostly count only execution time.
You may also take a look at Theano. It has nice integration with SymPy.
Ok, this could do the trick, I hope, if not let me know and I'll try again.
I compare a compiled version of an expression using Cython against the lambdified expression.
from sympy.utilities.autowrap import autowrap
from sympy import symbols, lambdify
def wraping(expression):
return autowrap(expression, backend='cython')
def lamFunc(expression, x, y):
return lambdify([x,y], expr)
x, y = symbols('x y')
expr = ((x - y)**(25)).expand()
print expr
binary_callable = wraping(expr)
print binary_callable(1, 2)
lamb = lamFunc(expr, x, y)
print lamb(1,2)
which outputs:
x**25 - 25*x**24*y + 300*x**23*y**2 - 2300*x**22*y**3 + 12650*x**21*y**4 - 53130*x**20*y**5 + 177100*x**19*y**6 - 480700*x**18*y**7 + 1081575*x**17*y**8 - 2042975*x**16*y**9 + 3268760*x**15*y**10 - 4457400*x**14*y**11 + 5200300*x**13*y**12 - 5200300*x**12*y**13 + 4457400*x**11*y**14 - 3268760*x**10*y**15 + 2042975*x**9*y**16 - 1081575*x**8*y**17 + 480700*x**7*y**18 - 177100*x**6*y**19 + 53130*x**5*y**20 - 12650*x**4*y**21 + 2300*x**3*y**22 - 300*x**2*y**23 + 25*x*y**24 - y**25
-1.0
-1
If I time the execution times the autowraped function is 10x faster (depending on the problem I have also observed cases where the factor was as little as two):
%timeit binary_callable(12, 21)
100000 loops, best of 3: 2.87 µs per loop
%timeit lamb(12, 21)
100000 loops, best of 3: 28.7 µs per loop
So here wraping(expr) wraps your expression expr and returns a wrapped object binary_callable. This you can use at any time to do numerical evaluation.
EDIT: I have done this on Linux/Ubuntu and Windows OS, both seem to work fine!

Parallelizing a Numpy vector operation

Let's use, for example, numpy.sin()
The following code will return the value of the sine for each value of the array a:
import numpy
a = numpy.arange( 1000000 )
result = numpy.sin( a )
But my machine has 32 cores, so I'd like to make use of them. (The overhead might not be worthwhile for something like numpy.sin() but the function I actually want to use is quite a bit more complicated, and I will be working with a huge amount of data.)
Is this the best (read: smartest or fastest) method:
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
result = pool.map( numpy.sin, a )
or is there a better way to do this?
There is a better way: numexpr
Slightly reworded from their main page:
It's a multi-threaded VM written in C that analyzes expressions, rewrites them more efficiently, and compiles them on the fly into code that gets near optimal parallel performance for both memory and cpu bounded operations.
For example, in my 4 core machine, evaluating a sine is just slightly less than 4 times faster than numpy.
In [1]: import numpy as np
In [2]: import numexpr as ne
In [3]: a = np.arange(1000000)
In [4]: timeit ne.evaluate('sin(a)')
100 loops, best of 3: 15.6 ms per loop
In [5]: timeit np.sin(a)
10 loops, best of 3: 54 ms per loop
Documentation, including supported functions here. You'll have to check or give us more information to see if your more complicated function can be evaluated by numexpr.
Well this is kind of interesting note if you run the following commands:
import numpy
from multiprocessing import Pool
a = numpy.arange(1000000)
pool = Pool(processes = 5)
result = pool.map(numpy.sin, a)
UnpicklingError: NEWOBJ class argument has NULL tp_new
wasn't expecting that, so whats going on, well:
>>> help(numpy.sin)
Help on ufunc object:
sin = class ufunc(__builtin__.object)
| Functions that operate element by element on whole arrays.
|
| To see the documentation for a specific ufunc, use np.info(). For
| example, np.info(np.sin). Because ufuncs are written in C
| (for speed) and linked into Python with NumPy's ufunc facility,
| Python's help() function finds this page whenever help() is called
| on a ufunc.
yep numpy.sin is implemented in c as such you can't really use it directly with multiprocessing.
so we have to wrap it with another function
perf:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.032201
Multithreaded 10.550432
wow, wasn't expecting that either, well theres a couple of issues for starters we are using a python function even if its just a wrapper vs a pure c function, and theres also the overhead of copying the values, multiprocessing by default doesn't share data, as such each value needs to be copy back/forth.
do note that if properly segment our data:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = [numpy.arange(100000) for _ in xrange(10)]
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.150192
Multithreaded 0.055083
So what can we take from this, multiprocessing is great but we should always test and compare it sometimes its faster and sometimes its slower, depending how its used ...
Granted you are not using numpy.sin but another function I would recommend you first verify that indeed multiprocessing will speed up the computation, maybe the overhead of copying values back/forth may affect you.
Either way I also do believe that using pool.map is the best, safest method of multithreading code ...
I hope this helps.
SciPy actually has a pretty good writeup on this subject here.

Categories