How to simulate an approximation of pi with Numba?

How to simulate an approximation of pi with Numba? - python

I would like to approximate the value of pi using a Monte Carlo method, with different input points (10, 10**1 and so on) and get the code faster with Numba.
Here as follows, there is a simulation with inputs = 10
import numba
from random import *
import math
from math import sqrt
#numba.jit(nopython=True)
def go_fast(pi):
inside = 0
inputs =10
for i in range(0,inputs):
x=random()
y=random()
if sqrt(x*x+y*y)<=1:
inside+=1
pi=4*inside/inputs
return (pi)
pi = math.pi
go_fast(pi)
I would just like to make sure that the simulation was correctly set since the result get from here seems a bit misleading. Thanks

This function
def go_fast(pi):
inside = 0
inputs =10
for i in range(0,inputs):
x=random()
y=random()
if sqrt(x*x+y*y)<=1:
inside+=1
pi=4*inside/inputs
return (pi)
would terminate as soon as x, y for which sqrt(x*x+y*y)<=1 hold will be encountered, in other words number of turns of your for loop is not deterministic (any value between 1 and 10). If you want constant number of turns of said for you need to put return outside body of said loop, this also apply to pi calucation as it should be done after data is collected, that is your code should be
def go_fast(pi):
inside = 0
inputs =10
for i in range(0,inputs):
x=random()
y=random()
if sqrt(x*x+y*y)<=1:
inside+=1
pi=4*inside/inputs
return pi

Related

Is it possible to solve this problem in parallel for several parameter values in Python?

Below is my task code. In this case e0=15, but I would like to solve this problem for a set of e0 values (e0 - parameter (e0 = 7, 10, 15, 20, 28)). I have a multi-core processor and I would like to distribute the calculations of this task for each parameter e0 to a separate core.
How to do parallel calculations for this task in Python?
import sympy as sp
import scipy as sc
import numpy as np
e0=15
einf=15
def Psi(r,n):
return 2*np.exp(-r/n)*np.sqrt(sc.special.factorial(n)/sc.special.factorial(-1+n))*sc.special.hyp1f1(1-n, 2, 2*r/n)/n**2
def PsiSymb(n):
r=sp.symbols('r')
y1=2*sp.exp(-r/n)*np.sqrt(sc.special.factorial(n)/sc.special.factorial(-1+n))/n**2
y2 = sp.simplify(sp.functions.special.hyper.hyper([1-n], [2], 2*r/n))
y=y1*y2
return y
def LaplacianPsi(n):
r = sp.symbols('r')
ydiff = 2/r*PsiSymb(n).diff(r)+PsiSymb(n).diff(r,2)
ydiffnum = sp.lambdify(r, ydiff, "numpy")
return ydiffnum
def k(n1,n2):
yint=sc.integrate.quad(lambda r: -0.5*Psi(r,n2)*LaplacianPsi(n1)(r)*r**2,0,np.inf)
return yint[0]
def p(n1,n2):
potC=sc.integrate.quad(lambda r: Psi(r,n2)*(-1/r)*Psi(r,n1)*(r**2),0,np.inf)
potB1=sc.integrate.quad(lambda r: Psi(r,n2)*(1/einf-1/e0)*((einf/e0)**(3/5))*(-e0/(2*r))*(np.exp(-r*2.23))*Psi(r,n1)*(r**2),0,np.inf)
potB2=sc.integrate.quad(lambda r: Psi(r,n2)*(1/einf-1/e0)*((einf/e0)**(3/5))*(-e0/(2*r))*(np.exp(-r*2.4))*Psi(r,n1)*(r**2),0,np.inf)
pot=potC[0]+potB1[0]+potB2[0]
return pot
def en(n1,n2):
return k(n1,n2)+p(n1,n2)
nmax=3
EnM = [[0]*nmax for i in range(nmax)]
for n1 in range(nmax):
for n2 in range(nmax):
EnM[n2][n1]=en(n1+1,n2+1)
EnEig=sc.linalg.eigvalsh(EnM)
EnB=min(EnEig)
print(EnB)

This is not needed to use multiple cores for this computation. Indeed, the bottleneck is the LaplacianPsi function which recompute the same thing over and over. You can use memoization to fix this. Here is an example:
import functools
#functools.cache
def LaplacianPsi(n):
r = sp.symbols('r')
ydiff = 2/r*PsiSymb(n).diff(r)+PsiSymb(n).diff(r,2)
ydiffnum = sp.lambdify(r, ydiff, "numpy")
return ydiffnum
# The rest is the same
The code can be further optimized since sc.special.factorial(n) / sc.special.factorial(-1+n) is actually just n and np.sqrt is inefficient on scalar so it should be replaced with math.sqrt(n). This results in a code taking only 0.057 seconds as opposed to 16.5 seconds for the initial implementation on my machine. This means the new implementation is 290 times faster while it produces the same result!
Directly using many cores would just have wasted more resources for a slower result. You can still try to use more cores to compute this with the faster provided implementation though it might not be significantly faster.

is the overflow error a result of bad formatting?

Function I tried to replicate:
doing a project for coursework in which I need to make the blackbody function and manipulate it in some ways.
I'm trying out alternate equations and in doing 2 of them i keep getting over flow error.
this is the error message:
alt2_2 = (1/((const_e**(freq/temp))-1))
OverflowError: (34, 'Result too large')
temp is given in kelvin (im using 5800 as my test value as it is approximately the temp of the sun)
freq is speed of light divided by whatever wavelength is inputted
freq = (3*(10**8))/wavelength
in this case i am using 0.00000005 as the test value for wavelength.
and const e is 2.7182
first time using stack. also first time doing a project on my own, any help appreciated.

This does the blackbody computation with your values.
import math
# Planck constant
h = 6.6e-34
# Boltzmann constant
k = 1.38e-23
# Speed of light
c = 3e+8
# Wavelength
wl = 0.00000005
# Temp
T = 5800
# Frequency
f = c/wl
# This is the exponent for e (about 49).
k1 = h*f / (k*T)
# This computes the spectral radiance.
Bvh = 2*f*f*f*h / (math.exp(k1)-1)
print(Bvh)
Output:
9.293819741690355e-08
Since we only used one or two digits on the way in, the resulting value is only good to one or two digits, 9.3E-08.

Numba Python - how to exploit parallelism effectively?

I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))

Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance

High-speed alternatives to replace byte array processing bottlenecks

>> See EDIT below <<
I am working on processing data from a special pixelated CCD camera over serial, using FTDI D2xx drivers via pyUSB.
The camera can operate at high bandwidth to the PC, up to 80 frames/sec. I would love that speed, but know that it isn't feasible with Python, due to it being a scripted language, but would like to know how close I can get - whether it be some optimizations that I missed in my code, threading, or using some other approach. I immediately think that breaking-out the most time consuming loops and putting them in C code, but I don't have much experience with C code and not sure the best way to get Python to interact inline with it, if that's possible. I have complex algorithms heavily developed in Python with SciPy/Numpy, which are already optimized and have acceptable performance, so I would need a way to just speed-up the acquisition of the data to feed-back to Python, if that's the best approach.
The difficulty, and the reason I used Python, and not some other language, is due to the need to be able to easily run it cross-platform (I develop in Windows, but am putting the code on an embedded Linux board, making a stand-alone system). If you suggest that I use another code, like C, how would I be able to work cross-platform? I have never worked with compiling a lower-level language like C between Windows and Linux, so I would want to be sure of that process - I would have to compile it for each system, right? What do you suggest?
Here are my functions, with current execution times:
ReadStream: 'RXcount' is 114733 for a device read, formatting from string to byte equivalent
Returns a list of bytes (0-255), representing binary values
Current execution time: 0.037 sec
def ReadStream(RXcount):
global ftdi
RXdata = ftdi.read(RXcount)
RXdata = list(struct.unpack(str(len(RXdata)) + 'B', RXdata))
return RXdata
ProcessRawData: To reshape the byte list into an array that matches the pixel orientations
Results in a 3584x32 array, after trimming off some un-needed bytes.
Data is unique in that every block of 14 rows represents 14-bits of one row of pixels on the device (with 32 bytes across # 8 bits/byte = 256 bits across), which is 256x256 pixels. The processed array has 32 columns of bytes because each byte, in binary, represents 8 pixels (32 bytes * 8 bits = 256 pixels). Still working on how to do that one... I have already posted a question for that previously
Current execution time: 0.01 sec ... not bad, it's just Numpy
def ProcessRawData(RawData):
if len(RawData) == 114733:
ProcessedMatrix = np.ndarray((1, 114733), dtype=int)
np.copyto(ProcessedMatrix, RawData)
ProcessedMatrix = ProcessedMatrix[:, 1:-44]
ProcessedMatrix = np.reshape(ProcessedMatrix, (-1, 32))
return ProcessedMatrix
else:
return None
Finally,
GetFrame: The device has a mode where it just outputs whether a pixel detected anything or not, using the lowest bit of the array (every 14th row) - Get that data and convert to int for each pixel
Results in 256x256 array, after processing every 14th row, which are bytes to be read as binary (32 bytes across ... 32 bytes * 8 bits = 256 pixels across)
Current execution time: 0.04 sec
def GetFrame(ProcessedMatrix):
if np.shape(ProcessedMatrix) == (3584, 32):
FrameArray = np.zeros((256, 256), dtype='B')
DataRows = ProcessedMatrix[13::14]
for i in range(256):
RowData = ""
for j in range(32):
RowData = RowData + "{:08b}".format(DataRows[i, j])
FrameArray[i] = [int(RowData[b:b+1], 2) for b in range(256)]
return FrameArray
else:
return False
Goal:
I would like to target a total execution time of ~0.02 secs/frame by whatever suggestions you make (currently it's 0.25 secs/frame with the GetFrame function being the weakest). The device I/O is not the limiting factor, as that outputs a data packet every 0.0125 secs. If I get the execution time down, then can I just run the acquisition and processing in parallel with some threading?
Let me know what you suggest as the best path forward - Thank you for the help!
EDIT, thanks to #Jaime:
Functions are now:
def ReadStream(RXcount):
global ftdi
return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)
... time 0.013 sec
def ProcessRawData(RawData):
if len(RawData) == 114733:
return RawData[1:-44].reshape(-1, 32)
return None
... time 0.000007 sec!
def GetFrame(ProcessedMatrix):
if ProcessedMatrix.shape == (3584, 32):
return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
return False
... time 0.00006 sec!
So, with pure Python, I am now able to acquire the data at the desired frame rate! After a few tweaks to the D2xx USB buffers and latency timing, I just clocked it at 47.6 FPS!
Last step is if there is any way to make this run in parallel with my processing algorithms? Need some way to pass the result of GetFrame to another loop running in parallel.

There are several places where you can speed things up significantly. Perhaps the most obvious is rewriting GetFrame:
def GetFrame(ProcessedMatrix):
if ProcessedMatrix.shape == (3584, 32):
return np.unpackbits(ProcessedMatrix[13::14]).reshape(256, 256)
return False
This requires that ProcessedMatrix be an ndarray of type np.uint8, but other than that, on my systems it runs 1000x faster.
With your other two functions, I think that in ReadStream you should do something like:
def ReadStream(RXcount):
global ftdi
return np.frombuffer(ftdi.read(RXcount), dtype=np.uint8)
Even if it doesn't speed up that function much, because it is the reading taking up most of the time, it will already give you a numpy array of bytes to work on. With that, you can then go on to ProcessRawData and try:
def ProcessRawData(RawData):
if len(RawData) == 114733:
return RawData[1:-44].reshape(-1, 32)
return None
Which is 10x faster than your version.

The fourth root of (12) or any other number in Python 3

I'm trying to make a simple code for power 12 to 4(12 ** 4) .
I have the output num (20736) but when I want to figure returns (20736) to its original value (12). I don't know how to do that in Python ..
in Real mathematics I do that by the math phrase {12؇}
The question is how to make {12؇} in Python ??
I'm using sqrt() but sqrt only for power 2
#!/usr/bin/env python3.3
import math
def pwo():
f=12 ** 4 #f =20736 #
c= # should c = 12 #
return f,c
print pwo()

def f(num):
return num**0.25
or
import math
def f(num):
return math.sqrt(math.sqrt(num))

For scientific purposes (where you need a high level of precision), you can use numpy:
>>> def root(n, r=4):
... from numpy import roots
... return roots([1]+[0]*(r-1)+[-n])
...
>>> print(root(12))
[ -1.86120972e+00+0.j -3.05311332e-16+1.86120972j
-3.05311332e-16-1.86120972j 1.86120972e+00+0.j ]
>>>
The output may look strange, but it can be used just as you would use a list. Furthermore, the above function will allow you to find any root of any number (I put the default for r equal to 4 since you asked for the fourth root specifically). Finally, numpy is a good choice because it will return the complex numbers that will also satisfy your equations.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to simulate an approximation of pi with Numba? - python

Related

Is it possible to solve this problem in parallel for several parameter values in Python?

is the overflow error a result of bad formatting?

Numba Python - how to exploit parallelism effectively?

High-speed alternatives to replace byte array processing bottlenecks

The fourth root of (12) or any other number in Python 3

Categories

Resources