Comparing R and Python Vectorization and Optimization

Comparing R and Python Vectorization and Optimization - python

In the R language, optimization can be achieved by using purrr::map() or furrr::future_map() functions. However, I am not sure how does optimization works for np.array() methods. Indeed, I would like to understand how does Python and R scales out to parallel processing [1, 2] in terms of complexity and performance.
Thus, the following questions arise:
How does the optimization of np.array() in Python works comparing to purrr::map() and furrr::future_map() functions in the R language?
By doing a simple tictoc test on purrr/furrr, I can observe that we have a big win from vectorization in both cases. Nonetheless, I can also notice that the results seem to show that the R language is just fundamentally faster.
Python
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
c = np.dot(a,b)
toc = time.time()
print ("Vectorized version:" + str(1000*(toc-tic)) +"ms")
c = 0
tic = time.time()
for i in range(1000000):
c += a[i]*b[i]
toc = time.time()
print("For loop:" + str(1000*(toc-tic)) +"ms")
Output
Vectorized version: 54.151296615600586ms
For loop: 676.0082244873047ms
R
a <- runif(1000000,0,1)
b <- runif(1000000,0,1)
c = 0
tictoc::tic()
c = sum(a * b)
tictoc::toc()
c = 0
tictoc::tic()
for (i in 1:length(a)) {
c = a[i]*b[i] + c
}
tictoc::toc()
Output
Vectorized version: 0.013 sec elapsed
For loop: 0.065 sec elapsed
References
[1] Ross Ihaka & Robert Gentleman (1996) R: A Language for Data Analysis and Graphics, Journal of Computational and Graphical Statistics, 5:3, 299-314, DOI: 10.1080/10618600.1996.10474713
[2] S. van der Walt, S. C. Colbert and G. Varoquaux, "The NumPy Array: A Structure for Efficient Numerical Computation," in Computing in Science & Engineering, vol. 13, no. 2, pp. 22-30, March-April 2011, doi: 10.1109/MCSE.2011.37

I believe numpy wraps some of its "primitive" objects in wrapper classes which are, themselves, Python (eg. this one).
When looking at the R mirror source, I conversely find an array class that's basically native code (aka C).
That extra indirection layer alone could explain the difference in speed, I guess.

Related

Numba Python - how to exploit parallelism effectively?

I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))

Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance

What is the fastest way to sort strings in Python if locale is a non-concern?

I was trying to find a fast way to sort strings in Python and the locale is a non-concern i.e. I just want to sort the array lexically according to the underlying bytes. This is perfect for something like radix sort. Here is my MWE
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = randChar("id%010d", N//K, N) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds
As you can see it took 6.8 seconds which is almost 10x slower than R's radix sort below.
N = 1e7
K = 100
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE)
system.time(sort(id3,method="radix"))
I understand that Python's .sort() doesn't use radix sort, is there an implementation somewhere that allows me to sort strings as performantly as R?
AFAIK both R and Python "intern" strings so any optimisations in R can also be done in Python.
The top google result for "radix sort strings python" is this gist which produced an error when sorting on my test array.

It is true that R interns all strings, meaning it has a "global character cache" which serves as a central dictionary of all strings ever used by your program. This has its advantages: the data takes less memory, and certain algorithms (such as radix sort) can take advantage of this structure to achieve higher speed. This is particularly true for the scenarios such as in your example, where the number of unique strings is small relative to the size of the vector. On the other hand it has its drawbacks too: the global character cache prevents multi-threaded write access to character data.
In Python, afaik, only string literals are interned. For example:
>>> 'abc' is 'abc'
True
>>> x = 'ab'
>>> (x + 'c') is 'abc'
False
In practice it means that, unless you've embedded data directly into the text of the program, nothing will be interned.
Now, for your original question: "what is the fastest way to sort strings in python"? You can achieve very good speeds, comparable with R, with python datatable package. Here's the benchmark that sorts N = 10⁸ strings, randomly selected from a set of 1024:
import datatable as dt
import pandas as pd
import random
from time import time
n = 10**8
src = ["%x" % random.getrandbits(10) for _ in range(n)]
f0 = dt.Frame(src)
p0 = pd.DataFrame(src)
f0.to_csv("test1e8.csv")
t0 = time(); f1 = f0.sort(0); print("datatable: %.3fs" % (time()-t0))
t0 = time(); src.sort(); print("list.sort: %.3fs" % (time()-t0))
t0 = time(); p1 = p0.sort_values(0); print("pandas: %.3fs" % (time()-t0))
Which produces:
datatable: 1.465s / 1.462s / 1.460s (multiple runs)
list.sort: 44.352s
pandas: 395.083s
The same dataset in R (v3.4.2):
> require(data.table)
> DT = fread("test1e8.csv")
> system.time(sort(DT$C1, method="radix"))
user system elapsed
6.238 0.585 6.832
> system.time(DT[order(C1)])
user system elapsed
4.275 0.457 4.738
> system.time(setkey(DT, C1)) # sort in-place
user system elapsed
3.020 0.577 3.600

Jeremy Mets posted in the comments of this blog post that Numpy can sort string fairly by converting the array to np.araray. This indeed improve performance, however it is still slower than Julia's implementation.
import numpy as np
import timeit
# randChar is workaround for MemoryError in mtrand.RandomState.choice
# http://stackoverflow.com/questions/25627161/how-to-solve-memory-error-in-mtrand-randomstate-choice
def randChar(f, numGrp, N) :
things = [f%x for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N=int(1e7)
K=100
id3 = np.array(randChar("id%010d", N//K, N)) # small groups (char)
timeit.Timer("id3.sort()" ,"from __main__ import id3").timeit(1) # 6.8 seconds

Load large computation sequence from file and evaluate matrix in Python

My problem involves large matrices (20GB+ in storage) where each matrix element consists of an algebraic expression. To bypass this size issue I wrote a script which converts the matrix into a computation sequence, and by doing so more than halves the file size. Here is an example of how this is done:
Consider the arithmetic expression (a 1x1 matrix):
Running this through the sequencing code produces:
Where the definitions for the tx parameters are:
t1 = A^2, t2 = t1*A, t3 = t1^2, t4 = t3*t2, t7 = t3*t1, t8 = B^2, t14 = C^2, t16 = t3*A, t17 = t8*B, t23 = t8^2, t33 = A+B, t34 = t33^2, t35 = t34^2
For this isolated example it may seem pointless, but when applied to a 10,000 x 10,000 matrix the number of common sequences between elements reduces the storage size substantially (like a zipping procedure).
My question is how best to process these definitions which are saved in files using Python to rebuild the matrix and evaluate the elements.
For the small (1x1) example above this can be done easily:
from __future__ import division
# Values for A,B,C
A = 1
B = 2
C = 3
# List of definitions
t1 = A**2
t2 = t1*A
t3 = t1**2
t4 = t3*t2
t7 = t3*t1
t8 = B**2
t14 = C**2
t16 = t3*A
t17 = t8*B
t23 = t8**2
t33 = A+B
t34 = t33**2
t35 = t34**2
# Print numerical result
print((1/2)*(6*B*C*t7+16*C*t16*t8+t1*t23*t8+B*t4+C*t4+t14*t7+31*t16*t17+6*t7*t8+16)/(t17*t14*C*t2*t35*t33))
Which gives the correct answer of 0.00565843621399. For matrices which have larger lists of definitions I have imported the file which works well, but when the size of the file gets larger (1GB+) the import runs into memory issues when creating the .pyc file.
I can read the file line by line but this makes the evaluation of the matrix more complicated as the tx definitions are now all strings.
I feel there are multiple ways to approach this problem but I am unsure on the most efficient implementation for when the matrices get very large, so I am here asking more experienced people for some insights on how best to proceed with the problem.

Basic multi GPU parallelization of matrix multiplication

I want to parallelize the simple following expression on 2 GPUs: C = A^n + B^n by calculating A^n on GPU 0 and B^n on GPU 1 before summing the results.
In TensorFlow I would go like:
with tf.device('/gpu:0'):
An = matpow(A, n)
with tf.device('/gpu:1'):
Bn = matpow(B, n)
with tf.Session() as sess:
C = sess.run(An + Bn)
However, since PyTorch is dynamic, I'm having trouble doing the same thing. I tried the following but it only takes more time.
with torch.cuda.device(0):
A = A.cuda()
with torch.cuda.device(1):
B = B.cuda()
C = matpow(A, n) + matpow(B, n).cuda(0)
I know there is a module to parallelize models on the batch dimension using torch.nn.DataParallel but here I try to do something more basic.

You can use cuda streams for this. This will not necessarily distribute it over two devices, but the execution will be in parallel.
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
with torch.cuda.stream(s1):
A = torch.pow(A,n)
with torch.cuda.stream(s2):
B = torch.pow(B,n)
C = A+B
Although I'm not sure whether it will really speed up your computation if you only parallelize this one operation. Your matrices must be really big.
If your requirement is to split it across devices, you can add this before the streams:
A = A.cuda(0)
B = B.cuda(1)
Then after the power operation, you need to get them on the same device again, e.g. B = B.cuda(0). After that you can do the addition.

Programming language with multiple roots

The answer to 2^(-1/3) are three roots:
0.79370, -0.39685-0.68736i and 0.39685+0.68736i (approximately)
See the correct answer at Wolfram Alpha.
I know several languages that supports complex numbers, but they all only return the first of the three results:
Python:
>>> complex(2,0)**(-1/3)
(0.7937005259840998-0j)
Octave:
>> (2+0i)^(-1/3)
ans = 0.79370
Julia:
julia> complex(2,0)^(-1/3)
0.7937005259840998 + 0.0im
What I'm looking for is something along the lines of:
>> 2^(-1/3)
[0.79370+0i, -0.39685-0.68736i, 0.39685+0.68736i]
Is there a programming language (with a REPL) that will correctly return all three roots, without having to resort to any special modules or libraries, that also has an open source implementation available?

As many comments explained, wanting a general purpose language to give by default the result from every branch of the complex root function is probably a tall order. But Julia allows specializing/overloading operators very naturally (as even the out-of-the-box implementation is often written in Julia). Specifically:
using Roots,Polynomials # Might need to Pkg.add("Roots") first
import Base: ^
^{T<:AbstractFloat}(b::T, r::Rational{Int64}) =
roots(poly([0])^r.den - b^abs(r.num)).^sign(r.num)
And now when trying to raise a float to a rational power:
julia> 2.0^(-1//3)
3-element Array{Complex{Float64},1}:
-0.39685-0.687365im
-0.39685+0.687365im
0.793701-0.0im
Note that specializing the definition of ^ to rational exponents solves the rounding problem mentioned in the comments.

Here is how to solve for all of the roots of b1/n via the roots the polynomial xn - b with Matlab's roots or Octave's roots:
b = 2;
n = -3; % for b^(1/n)
c = [1 zeros(1,abs(n)-1) -b];
r = roots(c).^sign(n);
which returns
r =
-0.396850262992050 - 0.687364818499301i
-0.396850262992050 + 0.687364818499301i
0.793700525984100 + 0.000000000000000i
Alternatively, using roots of unity (not sure how numerically robust this is):
b = 2;
n = -3;
n0 = abs(n);
r0 = b^(1/n0);
w = exp(2*pi*1i/n0);
r = (r0*w.^(0:n0-1).').^sign(n)
Or using Matlab's Symbolic Math toolbox:
b = 2;
n = -3;
c = [1 zeros(1,abs(n)-1) -b];
r = solve(poly2sym(c)).^sign(n)
which returns:
r =
2^(2/3)/2
2^(2/3)/(2*((3^(1/2)*1i)/2 - 1/2))
-2^(2/3)/(2*((3^(1/2)*1i)/2 + 1/2))
In certain cases you might also find nthroot helpful (Octave documentation).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing R and Python Vectorization and Optimization - python

Related

Numba Python - how to exploit parallelism effectively?

What is the fastest way to sort strings in Python if locale is a non-concern?

Load large computation sequence from file and evaluate matrix in Python

Basic multi GPU parallelization of matrix multiplication

Programming language with multiple roots

Categories

Resources