Efficient double for loop over large matrices - python

I have the following code which I need to runt it more than one time. Currently, it takes too long. Is there an efficient way to write these two for loops.
ErrorEst=[]
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
if cooccurrenceCount[i][j]>0:
#print(coaccuranceCount[i][j]/ count_max)
weighting_factor = np.min(
[1.0,
math.pow(np.float32(cooccurrenceCount[i][j]/ count_max), scaling_factor)])
embedding_product = (np.multiply(emedingEnt[j], embedingFea[i]), 1)
#tf.log(tf.to_float(self.__cooccurrence_count))
log_cooccurrences =np.log (np.float32(cooccurrenceCount[i][j]))
distance_expr = np.square(([
embedding_product+
focal_bias[i],
context_bias[j],
-(log_cooccurrences)]))
single_losses =(weighting_factor* distance_expr)
temp.append(single_losses)
ErrorEst.append(np.sum(temp))

You can use Numba or Cython
At first make sure to avoid lists where ever possible and write a simple and readable code with explicit loops like you would do for example in C. All input and outputs are only numpy-arrays or scalars.
Your Code
import numpy as np
import numba as nb
import math
def your_func(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias):
ErrorEst=[]
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
if cooccurrenceCount[i][j]>0:
weighting_factor = np.min([1.0,math.pow(np.float32(cooccurrenceCount[i][j]/ count_max), scaling_factor)])
embedding_product = (np.multiply(emedingEnt[j], embedingFea[i]), 1)
log_cooccurrences =np.log (np.float32(cooccurrenceCount[i][j]))
distance_expr = np.square(([embedding_product+focal_bias[i],context_bias[j],-(log_cooccurrences)]))
single_losses =(weighting_factor* distance_expr)
temp.append(single_losses)
ErrorEst.append(np.sum(temp))
return ErrorEst
Numba Code
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def your_func_2(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias):
ErrorEst=np.empty((embedingFea.shape[0],2))
for i in nb.prange(embedingFea.shape[0]):
temp_1=0.
temp_2=0.
for j in range(emedingEnt.shape[0]):
if cooccurrenceCount[i,j]>0:
weighting_factor=(cooccurrenceCount[i,j]/ count_max)**scaling_factor
if weighting_factor>1.:
weighting_factor=1.
embedding_product = emedingEnt[j]*embedingFea[i]
log_cooccurrences =np.log(cooccurrenceCount[i,j])
temp_1+=weighting_factor*(embedding_product+focal_bias[i])**2
temp_1+=weighting_factor*(context_bias[j])**2
temp_1+=weighting_factor*(log_cooccurrences)**2
temp_2+=weighting_factor*(1.+focal_bias[i])**2
temp_2+=weighting_factor*(context_bias[j])**2
temp_2+=weighting_factor*(log_cooccurrences)**2
ErrorEst[i,0]=temp_1
ErrorEst[i,1]=temp_2
return ErrorEst
Timings
embedingFea=np.random.rand(1700)+1
emedingEnt=np.random.rand(1500)+1
cooccurrenceCount=np.random.rand(1700,1500)+1
focal_bias=np.random.rand(1700)
context_bias=np.random.rand(1500)
count_max=100
scaling_factor=2.5
%timeit res_1=your_func(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias)
1min 1s ± 346 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=your_func_2(embedingFea,emedingEnt,cooccurrenceCount,count_max,scaling_factor,focal_bias,context_bias)
17.6 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you need to increase the performance of your code you should write it in low level language like C and try to avoid the usage of floating point numbers.
Possible solution: Can we use C code in Python?

You could try using numba and wrapping your code with the #jit decorator. Usually the first execution needs to compile some stuff, and will thus not see much speedup, but subsequent iterations will be much faster.
You may need to put your loop in a function for this to work.
from numba import jit
#jit(nopython=True)
def my_double_loop(some, arguments):
for i in range(len(embedingFea)):#17000
temp=[]
for j in range(len(emedingEnt)):#15000
# ...

Related

iPython timeit - only time part of the operation

I was attempting to determine, via iPython's %%timeit mechanism, whether set.remove is faster than list.remove when a conundrum came up.
I could do
In [1]: %%timeit
a_list = list(range(100))
a_list.remove(50)
and then do the same thing but with a set. However, this would include the overhead from the list/set construction. Is there a way to re-build the list/set each iteration but only time the remove method?
Put your setup code on the same line to create any names or precursor operations you need!
https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-timeit
In cell mode, the statement in the first line is used as setup code (executed but not timed) and the body of the cell is timed. The cell body has access to any variables created in the setup code.
%%timeit setup_code
...
Unfortunately only a single run can be done as it does not re-run the setup code
%%timeit -n1 x = list(range(100))
x.remove(50)
Surprisingly, this doesn't accept a string like the timeit module, so combined with the single run requirement, I'd still defer to timeit with a string setup= and repeat it if lots of setup or a statistically higher precision is needed
See #Kelly Bundy's much more precise answer for more!
Alternatively, using the timeit module with more repetitions and some statistics:
list: 814 ns ± 3.7 ns
set: 152 ns ± 1.6 ns
list: 815 ns ± 4.3 ns
set: 154 ns ± 1.6 ns
list: 817 ns ± 4.3 ns
set: 153 ns ± 1.6 ns
Code (Try it online!):
from timeit import repeat
from statistics import mean, stdev
for _ in range(3):
for kind in 'list', 'set':
ts = repeat('data.remove(50)', f'data = {kind}(range(100))', number=1, repeat=10**5)
ts = [t * 1e9 for t in sorted(ts)[:1000]]
print('%4s: %3d ns ± %.1f ns' % (kind, mean(ts), stdev(ts)))

Unable to get expected speed up with cython (search in a list of strings)

I am unable to speed up computations with my cythonized version of a function bulk_phone_finder defined below (that uses phonenumbers pypi library) to :
from phonenumbers import PhoneNumberMatcher
from phonenumbers.phonenumberutil import SUPPORTED_REGIONS
def bulk_phone_finder(l):
return [get_phonenumbers(content) for content in l]
def get_phonenumbers(content):
return set([m.raw_string for cc in SUPPORTED_REGIONS for m in PhoneNumberMatcher(content,
cc)
])
I am a bit new to cython , but i managed to make a cython equivalent function :
# cython: c_string_type=unicode, c_string_encoding=utf8
from phonenumbers import PhoneNumberMatcher
from phonenumbers.phonenumberutil import SUPPORTED_REGIONS
from libcpp.vector cimport vector
from libcpp.string cimport string
from libcpp.set cimport set as c_set
cpdef c_set[string] _SUPPORTED_REGIONS = SUPPORTED_REGIONS
cdef get_phonenumbers(string content):
res = set()
cdef string cc
cdef string raw_string
for cc in _SUPPORTED_REGIONS :
for m in PhoneNumberMatcher(content, cc):
raw_string = m.raw_string
res.add(raw_string)
return res
def bulk_phone_finder(l):
res=[]
cpdef c_set[string] int_res
cpdef string content
for content in l :
int_res= get_phonenumbers(content)
res.append(int_res)
return res
this cython .pyx file is compiled with the following setup.py
from distutils.core import setup, Extension
from Cython.Build import cythonize
setup( ext_modules = cythonize(Extension(
"bulk_phonenumber_finder",
sources=["bulk_phonenumber_finder.pyx"],
# extra_compile_args=["-std=c++11"],
language="c++"
)))
But i'm unable to get the speed up as :
lines=["0256985412"]*100
# cython version
In [6]: %timeit bulk_phonenumber_finder.bulk_phone_finder(lines)
9.39 s ± 363 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# python version
In [8]: %timeit bulk_phone_finder(lines)
9.44 s ± 213 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any advice to speed up my function bulk_phone_finder ?
Any help is more than welcome ,
Thanks in advance,

Python timeit - TypeError: 'module' object is not callable

I usually use timeit in jupyter notebook like this:
def some_function():
for x in range(1000):
return x
timeit(some_func())
and get the result like this:
6.3 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but today I got an error like this:
TypeError Traceback (most recent call last)
<ipython-input-11-fef6a46355f1> in <module>
----> 1 timeit(some_func())
TypeError: 'module' object is not callable
How does it occur?
You are currently trying to execute the timeit module, rather than the function contained within.
You should change your import statement from import timeit to from timeit import timeit. Alternatively, you can call the function using timeit.timeit.
After searching and trying for a while I realize that when we want to use timeit(some_function()), we do not need import timeit but we should write it in another input of jupyter notebook like this:
IN [1]:
def some_function():
for x in range(1000):
return x
IN [2]:
timeit(some_func())
and we will get output like this:
280 ns ± 2.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
When we write it in one input like this:
IN [1]:
def some_function():
for x in range(1000):
return x
timeit(some_func())
we'll get an error timeit not define and when we 'import timeit' we'll got another error like I produce on the question TypeError: 'module' object is not callable.
because when we import timeit we need to specify the stmt and setup (if available) e.g:
import timeit
SETUP = """
import yourmodul_here
"""
TEST_CODE = """
def some_function():
for x in range(1000):
return x
"""
timeit.timeit(stmt=TEST_CODE, setup=SETUP, number=2000000)
And we'll get the output like this:
0.12415042300017376
stmt is code to run
setup is something that need to load before TEST_CODE run
The stmt will execute as per the number is given here. default = 1000000
so when we import timeit we need to write more I guess.

Get the RGB value of a specific pixel live

How would I be able to get the RGB value of a pixel on my screen live with python? I have tried using
from PIL import ImageGrab as ig
while(True):
screen = ig.grab()
g = (screen.getpixel((358, 402)))
print(g)
to get the value but there is noticeable lag.
Is there another way to do this without screen capturing? Because I think this is the cause of lag.
Is there a way to drastically speed up this process?
is it possible to constrain the ig.grab() to 358, 402 and get the values from there?
You will probably find it faster to use mss, which is specifically designed to provide high speed screenshot capabilities in Python, and can be used like so:
import mss
with mss.mss() as sct:
pic = sct.grab({'mon':1, 'top':358, 'left':402, 'width':1, 'height':1})
g = pic.pixel(0,0)
See the mss documentation for more information. The most important thing is that you want to avoid repeatedly doing with mss.mss() as sct but rather re-use a single object.
The following change will speed up it by 15%
pixel = (358, 402)
pixel_boundary = (pixel + (pixel[0]+1, pixel[1]+1))
g = ig.grab(pixel_boundary)
return g.getpixel((0,0))
Runtime:
proposed: 383 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
original: 450 ms ± 5.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Interpretation vs dynamic dispatch penalty in Python

I watched Brandon Rhodes' talk about Cython - "The Day of the EXE Is Upon Us".
Brandon mentions at 09:30 that for a specific short piece of code, skipping interpretation gave 40% speedup, while skipping the allocation and dispatch gave 574% speedup (10:10).
My question is - how is this measured for a specific piece of code? Does one need to manually extract the underlying c commands and then somehow make the runtime run them?
This is a very interesting observation, but how do I recreate the experiment?
Let's take a look at this python function:
def py_fun(i,N,step):
res=0.0
while i<N:
res+=i
i+=step
return res
and use ipython-magic to time it:
In [11]: %timeit py_fun(0.0,1.0e5,1.0)
10 loops, best of 3: 25.4 ms per loop
The interpreter will be running through the resulting bytecode and interpret it. However, we could cut out the interpreter by using cython for/cythonizing the very same code:
%load_ext Cython
%%cython
def cy_fun(i,N,step):
res=0.0
while i<N:
res+=i
i+=step
return res
We get a speed up of 50% for it:
In [13]: %timeit cy_fun(0.0,1.0e5,1.0)
100 loops, best of 3: 10.9 ms per loop
When we look into the produced c-code, we see that the right functions are called directly without the need of being interpreted/calling ceval, here after stripping down the boilerplate code:
static PyObject *__pyx_pf_4test_cy_fun(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_i, PyObject *__pyx_v_N, PyObject *__pyx_v_step) {
...
while (1) {
__pyx_t_1 = PyObject_RichCompare(__pyx_v_i, __pyx_v_N, Py_LT);
...
__pyx_t_2 = __Pyx_PyObject_IsTrue(__pyx_t_1);
...
if (!__pyx_t_2) break;
...
__pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_res, __pyx_v_i);
...
__pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_i, __pyx_v_step);
}
...
return __pyx_r;
}
However, this cython function handles python-objects and not c-style floats, so in the function PyNumber_InPlaceAdd it is necessary to figure out, what these objects (integer, float, something else?) really are and to dispatch this call to right functions which would do the job.
With help of cython we could also eliminate the need for this dispatch and to call directly the multiplication for floats:
%%cython
def c_fun(double i,double N, double step):
cdef double res=0.0
while i<N:
res+=i
i+=step
return res
In this version, i, N, step and res are c-style doubles and no longer python objects. So there is no longer need to call dispatch-functions like PyNumber_InPlaceAdd but we can directly call +-operator for double:
static PyObject *__pyx_pf_4test_c_fun(CYTHON_UNUSED PyObject *__pyx_self, double __pyx_v_i, double __pyx_v_N, double __pyx_v_step) {
...
__pyx_v_res = 0.0;
...
while (1) {
__pyx_t_1 = ((__pyx_v_i < __pyx_v_N) != 0);
if (!__pyx_t_1) break;
__pyx_v_res = (__pyx_v_res + __pyx_v_i);
__pyx_v_i = (__pyx_v_i + __pyx_v_step);
}
...
return __pyx_r;
}
And the result is:
In [15]: %timeit c_fun(0.0,1.0e5,1.0)
10000 loops, best of 3: 148 µs per loop
Now, this is a speed-up of almost 100 compared to the version without interpreter but with dispatch.
Actually, to say, that dispatch+allocation is the bottle neck here (because eliminating it caused a speed-up of almost factor 100) is a fallacy: the interpreter is responsible for more than 50% of the running time (15 ms) and dispatch and allocation "only" for 10ms.
However, there are more problems than "interpreter" and dynamic dispatch for the performance: Float is immutable, so every time it changes a new object must be created and registered/unregistered in garbage collector.
We can introduce mutable floats, which are changed in place and don't need registering/unregistering:
%%cython
cdef class MutableFloat:
cdef double x
def __cinit__(self, x):
self.x=x
def __iadd__(self, MutableFloat other):
self.x=self.x+other.x
return self
def __lt__(MutableFloat self, MutableFloat other):
return self.x<other.x
def __gt__(MutableFloat self, MutableFloat other):
return self.x>other.x
def __repr__(self):
return str(self.x)
The timings (now I use a different machine, so the timings a little bit different):
def py_fun(i,N,step,acc):
while i<N:
acc+=i
i+=step
return acc
%timeit py_fun(1.0, 5e5,1.0,0.0)
30.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%timeit cy_fun(1.0, 5e5,1.0,0.0)
16.9 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); py_fun(i,N,step,acc)
23 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); cy_fun(i,N,step,acc)
11 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Don't forget to reinitialize i because it is mutable! The results
immutable mutable
py_fun 30ms 23ms
cy_fun 17ms 11ms
So up to 7ms (about 20%) are needed for registering/unregistering floats (I'm not sure there is not something else playing a role) in the version with the interpreter and more then 33% in the version without the interpreter.
As it looks now:
40% (13/30) of the time is used by interpreter
up to 33% of the time is used for the dynamic dispatch
up to 20% of the time is used for creating/deleting temporary objects
about 1% for the arithmetical operations
Another problem is the locality of the data, which becomes obvious for memory band-width bound problems: The modern caches work well for if data processed linearly one consecutive memory address after another. This is true for looping over std::vector<> (or array.array), but not for looping over python lists, because this list consists of pointers which can point to any place in the memory.
Consider the following python scripts:
#list.py
N=int(1e7)
lst=[0]*int(N)
for i in range(N):
lst[i]=i
print(sum(lst))
and
#byte
N=int(1e7)
b=bytearray(8*N)
m=memoryview(b).cast('L') #reinterpret as an array of unsigned longs
for i in range(N):
m[i]=i
print(sum(m))
they both create 1e7 integers, the first version Python-integers and the second the lowly c-ints which are placed continuously in the memory.
The interesting part is, how many cache misses (D) these scripts produce:
valgrind --tool=cachegrind python list.py
...
D1 misses: 33,964,276 ( 27,473,138 rd + 6,491,138 wr)
versus
valgrind --tool=cachegrind python bytearray.py
...
D1 misses: 4,796,626 ( 2,140,357 rd + 2,656,269 wr)
That means 8 time more cache misses for the python-integers. Some part of it is due to the fact, that python integers need more than 8 bytes (probably 32bytes, i.e. factor 4) memory and (maybe, not 100% sure, because neighboring integers are created after each other, so the chances are high, they are stored after each other somewhere in memory, further investigation needed) some due to the fact, that they aren't aligned in memory as it is the case for c-integers of bytearray.

Categories