Factorializing Medium Numpy Ints Creates Runtime Warning

Factorializing Medium Numpy Ints Creates Runtime Warning - python

I am checking run times on factorials (have to use the user-defined function), but I receive an odd error. Code I'm working with is as follows:
import numpy as np
import time
np.random.seed(14)
nums = list(np.random.randint(low=100, high=500, size=10))
# nums returns as [207, 444, 368, 427, 349, 458, 334, 256, 238, 308]
def fact(x):
if x == 1:
return 1
else:
return x * fact(x-1)
recursion_times = []
recursion_factorials = []
for i in nums:
t1 = time.perf_counter()
factorial = fact(i)
t2 = time.perf_counter()
execution = t2-t1
recursion_factorials.append(factorial)
recursion_times.append(execution)
print(execution)
When I run the above, I get the following:
RuntimeWarning: overflow encountered in long_scalars"""
But when I run it as below, I get no warnings.
recursion_times = []
recursion_factorials = []
for i in [207, 444, 368, 427, 349, 458, 334, 256, 238, 308]:
t1 = time.perf_counter()
factorial = fact(i)
t2 = time.perf_counter()
execution = t2-t1
recursion_factorials.append(factorial)
recursion_times.append(execution)
print(execution)
I know it's a bit of extra overhead to call the list nums, but why would it trigger a runtime warning? I've tried digging around but I only get dynamically-named variable threads and warning suppression libraries - I'm looking for why this might happen.
For what it's worth, I'm running Python3 in a jupyter notebook. Glad to answer any other questions if it will help.
Thanks in advance for the help!

If (as in the current version of your post) you created nums by calling list on a NumPy array, but wrote an explicit list literal with no NumPy for the second test, then the second test gives no warning because it's not using NumPy. nums is a list of NumPy fixed-width integers, while the other list is a list of ordinary Python ints. Ordinary Python ints don't overflow.
(If you want to create a list of ordinary Python scalars from a NumPy array, the way to do that is with array.tolist(). This is usually undesirable due to performance implications, but it is occasionally necessary to interoperate with code that chokes on NumPy types.)
There would usually be an additional effect due to the default Python warning handling. By default, Python only emits a warning once per code location per Python process. In the original version of your question, it looked like this was causing the difference.
Using a variable or not using a variable has no effect on this warning.

Related

ERROR : When trying deployment of model in ML I get-'dict' object is not callable

I have used the below code for my python streamlit deployment of ML Model.
import streamlit as st
import pickle
import numpy as np
import pandas as pd
similarity=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\similarity.pkl','rb'),buffers=None)
list=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\movies_dict.pkl','rb'),buffers=None)
movies=pd.DataFrame.from_dict(list)
def recomm(movie):
mov_index=movies[movies['title']==movie].index[0]
sim=similarity[mov_index]
movlist=sorted(list(enumerate(sim)),reverse=True,key=lambda x:x[1])[1:6]
rec_movie=[]
for i in movlist:
# print(i[0])
rec_movie.append(movies.iloc[i[0]]['title'])
return rec_movie
st.title('Movie Recommender System')
selected_movie_name = st.selectbox(
'How would you like to be contacted?',
movies['title'].values)
if st.button('Recommend'):
recom=recomm(selected_movie_name)
# recom=np.array(recom)
for i in recom:
st.write(i)
On colab the code is working fine but on vscode it was showing this error.
File "C:\Users\anaconda3\envs\Streamlit\lib\site-packages\streamlit\scriptrunner\script_runner.py", line 554, in _run_script
exec(code, module.__dict__)
File "C:\Users\OneDrive\Desktop\mlproject\app.py", line 30, in <module>
recom=recomm(selected_movie_name)
File "C:\Users\OneDrive\Desktop\mlproject\app.py", line 15, in recomm
movlist=sorted(list(enumerate(sim)),reverse=True,key=lambda x:x[1])[1:6]
Now I had to use different IDEs for deployement. But when I removed the keyword 'list' in the given line 15 it worked fine. What can be the reason behind it? I am ca begineer and really curious about it. Thank you.

But when I removed the keyword 'list' in the given line 15 it worked fine. What can be the reason behind it?
TL;DR: sorted accepts iterables, and enumerate is already an iterable
Long answer:
When you define list as
list=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\movies_dict.pkl','rb'),buffers=None)
you're overriding Python's built-in list type. Python lets you do this without issuing any warnings, but the result is that, in your script, list now represents a dictionary object. The result of this is that when you call list(enumerate(sim)) later on, you're treating your dictionary object as a callable, which it is not.
The solution? Avoid overriding Python built-ins whenever you can.
import streamlit as st
import pickle
import numpy as np
import pandas as pd
similarity=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\similarity.pkl','rb'),buffers=None)
movies_dict=pickle.load(open(r'C:\Users\nikso\OneDrive\Desktop\mlproject\movies_dict.pkl','rb'),buffers=None)
movies=pd.DataFrame.from_dict(movies_dict)
def recomm(movie):
mov_index=movies[movies['title']==movie].index[0]
sim=similarity[mov_index]
movlist=sorted(list(enumerate(sim)),reverse=True,key=lambda x:x[1])[1:6]
rec_movie=[]
for i in movlist:
# print(i[0])
rec_movie.append(movies.iloc[i[0]]['title'])
return rec_movie
st.title('Movie Recommender System')
selected_movie_name = st.selectbox(
'How would you like to be contacted?',
movies['title'].values)
if st.button('Recommend'):
recom=recomm(selected_movie_name)
# recom=np.array(recom)
for i in recom:
st.write(i)
To answer specifically why removing "list" on line 15 seemed to fix the issue, though: sorted accepts iterables, and enumerate is already an iterable. All list is doing on line 15 is gathering the results of enumerate before passing them into sorted. But the fundamental reason why removing list fixed things is because you're overriding Python's built-in, which you probably want to avoid doing.

Is there an alternative to Numba for functions that use many features not supported by Numba?

I know Numba does not support all Python features nor all NumPy features.
However I really need to speed up the execution time of the following function, which is block_reduce available in the scikit-image library (I've not downloaded the whole package, I've just taken block_reduce and view_as_blocks from it).
Here is the original code (I've just removed the examples from the docstring).
block_reduce.py
import numpy as np
from numpy.lib.stride_tricks import as_strided
def block_reduce(image, block_size, func=np.sum, cval=0):
"""
Taken from scikit-image to avoid installation (it's very big)
Down-sample image by applying function to local blocks.
Parameters
----------
image : ndarray
N-dimensional input image.
block_size : array_like
Array containing down-sampling integer factor along each axis.
func : callable
Function object which is used to calculate the return value for each
local block. This function must implement an ``axis`` parameter such
as ``numpy.sum`` or ``numpy.min``.
cval : float
Constant padding value if image is not perfectly divisible by the
block size.
Returns
-------
image : ndarray
Down-sampled image with same number of dimensions as input image.
"""
if len(block_size) != image.ndim:
raise ValueError("`block_size` must have the same length "
"as `image.shape`.")
pad_width = []
for i in range(len(block_size)):
if block_size[i] < 1:
raise ValueError("Down-sampling factors must be >= 1. Use "
"`skimage.transform.resize` to up-sample an "
"image.")
if image.shape[i] % block_size[i] != 0:
after_width = block_size[i] - (image.shape[i] % block_size[i])
else:
after_width = 0
pad_width.append((0, after_width))
image = np.pad(image, pad_width=pad_width, mode='constant',
constant_values=cval)
blocked = view_as_blocks(image, block_size)
return func(blocked, axis=tuple(range(image.ndim, blocked.ndim)))
def view_as_blocks(arr_in, block_shape):
"""Block view of the input n-dimensional array (using re-striding).
Blocks are non-overlapping views of the input array.
Parameters
----------
arr_in : ndarray
N-d input array.
block_shape : tuple
The shape of the block. Each dimension must divide evenly into the
corresponding dimensions of `arr_in`.
Returns
-------
arr_out : ndarray
Block view of the input array.
"""
if not isinstance(block_shape, tuple):
raise TypeError('block needs to be a tuple')
block_shape = np.array(block_shape)
if (block_shape <= 0).any():
raise ValueError("'block_shape' elements must be strictly positive")
if block_shape.size != arr_in.ndim:
raise ValueError("'block_shape' must have the same length "
"as 'arr_in.shape'")
arr_shape = np.array(arr_in.shape)
if (arr_shape % block_shape).sum() != 0:
raise ValueError("'block_shape' is not compatible with 'arr_in'")
# -- restride the array to build the block view
new_shape = tuple(arr_shape // block_shape) + tuple(block_shape)
new_strides = tuple(arr_in.strides * block_shape) + arr_in.strides
arr_out = as_strided(arr_in, shape=new_shape, strides=new_strides)
return arr_out
test_block_reduce.py
import numpy as np
import time
from block_reduce import block_reduce
image = np.arange(3*3*1000).reshape(3, 3, 1000)
# DO NOT REPORT THIS... COMPILATION TIME IS INCLUDED IN THE EXECUTION TIME!
start = time.time()
block_reduce(image, block_size=(3, 3, 1), func=np.mean)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time.time()
block_reduce(image, block_size=(3, 3, 1), func=np.mean)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))
I went through many issues with this code.
For example Numba does not support function type parameters. But even if I try to work around this problem by using a string for this parameter (for example func would be the string "sum" instead of np.sum) I'll fall into many more issues related to features unsupported by Numba (like np.pad, isinstance, the tuple function, etc.).
Going through each single issue turned out to be very painful. For example, I've tried to incorporate all the code for np.pad from numpy into block_reduce.py and add the numba.jit decorator to np.pad but I got additional problems.
If there is a smart way to use Numba despite all these unsupported features I would be happy with it.
Otherwise is there any alternative to Numba for that? I know there is PyPy which I've never used. If PyPy is a solution for my problem I have to highlight I just need this single script block_reduce.py to run with PyPy. The rest of the project should be run with CPython.
I was also thinking of creating a C module extension, which I've never done. But if it's worth trying I will do.

Have you tried running detailed profiling of your code? If you are dissatisfied with the performance of your program I think it can be very helpful to use a tool such as cProfile or py-spy. This can identify bottlenecks in your program and which parts specifically need to be sped up.
That being said, as #CJR said, if your program is spending the bulk of the compute time in NumPy, there likely is no reason to worry about speeding it up using a just-in-time compiler or similar modifications to your setup. As explained in more detail here, NumPy is fast due to it implementing compute-intensive tasks in compiled languages, so it saves you from worrying about that and abstracts it away.
Depending on what exactly you are planning to do, it is possible that your efficiency could be improved by parallelism, but this is not something I would worry about yet.
To end on a more general note: while optimizing code efficiency is of course very important, it is imperative to do so carefully and deliberately. As Donald Knuth is famous for saying "premature optimization is the root of all evil (or at least most of it) in programming". See this stack exchange thread for some more discussion on this.

Numba "Use of unsupported opcode (CONTINUE_LOOP) found" error when using continue in for loop

I am getting an error trying to enable numba optimization on a defined function.
Here is the function simplified:
#jit
def monte_carlo(iterations):
key1 = []
key2 = []
score = []
for i in range(iterations):
random.seed(i)
temp_matrix = random.sample(matrix, length)
for j in range(iterations):
random.seed(j)
key2.append(i)
key1.append(j)
for x in range(...):
try: temp_matrix[x] = random.sample(matrix[x], len(matrix[x]))
except: continue
scores.append(...)
return scores, keyA, keyB
monte_carlo(1000)
Then I receive this as an error, also had issues using Cuda over Jit.
Traceback (most recent call last):
File "..."
File ...\numba\dispatcher.py", line 404, in _compile_for_args
error_rewrite(e, 'unsupported_error')
File "...\numba\dispatcher.py", line 344, in error_rewrite
reraise(type(e), e, None)
File "...\numba\six.py", line 668, in reraise
raise value.with_traceback(tb)
numba.errors.UnsupportedError: Failed in nopython mode pipeline (step: analyzing bytecode)
**Use of unsupported opcode (CONTINUE_LOOP) found**
File "...py", line 32:
def monte_carlo(iterations):
<source elided>
try: temp_qa_matrix[x] = random.sample(input.qa_matrix[x], len(input.qa_matrix[x]))
except: continue
^
So, it is not really liking the continue in a loop, despite being a supported construct.
Nunba Supported Python features

I think Numba's documentation is a little incomplete. It can handle ordinary continue statements, which use the Python opcode JUMP_ABSOLUTE under the hood, but not continue statements inside try/except blocks, which use the Python opcode CONTINUE_LOOP.
Here's an example of a simple function that (unnecessarily) uses continue and works with Numba. It cuts in half elements of an array that are greater than 0.5.
def halve(x):
for i in range(len(x)):
if x[i] <= 0.5:
continue
x[i] /= 2
If we import the Python dis module and look at the output of dis.dis(halve), we see that there are two JUMP_ABSOLUTE opcodes. This is what Python normally uses for continue statements. If we use Numba jit this function and run it an array, we'll see that it works with no problem.
But if we rewrite halve to use try/except:
def halve(x):
for i in range(len(x)):
try:
assert x[i] > 0.5
except:
continue
x[i] /= 2
and look at dis.dis(halve), we see that one of the JUMP_ABSOLUTE opcodes has been replaced by CONTINUE_LOOP. I don't know the Python details under the hood, but sure enough, if we try to jit this function, then Numba complains that there's an unsupported opcode.
So, TLDR: looks like you can't use continue inside try/except with Numba, for obscure reasons related to the Python implementation.
I suspect there's almost always a workaround to this, but since your code isn't fully self-contained it's hard for me to know.
(Side note: usually Numba will do much better if you use NumPy arrays rather than lists.)

Pythons multiprocess module (with dill) gives an unhelpful AssertionError

I have installed dill/pathos and its dependencies (with some difficulty) and I'm trying to perform a function over several processes. The class/attribute Model(self.xml,self.exp_data,i).SSR is custom made and depends on loads of other custom functions so I apologize in advance for not being able to provide 'runnable' code. In brief however, it takes some experimental data, integrates a system of ODE's with python's pysces module and calculates the sum of squares (SSR). The purpose for parallelizing this code is to speed up this calculation with multiple parameter sets.
The code:
import multiprocess
def evaluate_chisq(pop):
p = multiprocess.Pool(8)
res= p.map(lambda i:Model(self.xml,self.exp_data,i).SSR , pop)#calcualteSSR with this parameter set
return res
The error message I get is:
File "C:\Anaconda1\lib\site-packages\multiprocess\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Anaconda1\lib\site-packages\multiprocess\pool.py", line 567, in get
raise self._value
AssertionError
Then I have tried using map_async :
def evaluate_chisq(pop):
p = multiprocess.Pool(8)
res= p.map_async(lambda i:Model(self.xml,self.exp_data,i).SSR , pop)#calcualteSSR with this parameter set
return res
which returns a <multiprocess.pool.MapResult object at 0x0000000014AF8C18> object which gives me the same error when I attempts to use the MapResult's `get' method
File "C:\Anaconda1\lib\site-packages\multiprocess\pool.py", line 567, in get
raise self._value
AssertionError
Does anybody know what I'm doing wrong?

On Windows you need to use freeze_support from __main__.
See https://docs.python.org/2/library/multiprocessing.html#multiprocessing.freeze_support.

Serialize iterator object to be passed between processes in Python

I have a python script that calculates the eigenvalues of matrices from a list, and I would like to insert these eigenvalues into another collection in the same order as the original matrix and I would like to do this by spawning up multiple processes.
Here is my code:
import time
import collections
import numpy as NP
from scipy import linalg as LA
from joblib import Parallel, delayed
def computeEigenV(unit_of_work):
current_index = unit_of_work[0]
current_matrix = unit_of_work[1]
e_vals, e_vecs = LA.eig(current_matrix)
finished_unit = (current_index, lowEV[::-1])
return finished_unit
def run(work_list):
pool = Parallel( n_jobs = -1, verbose = 1, pre_dispatch = 'all')
results = pool(delayed(computeEigenV)(unit_of_work) for unit_of_work in work_list)
return results
if __name__ == '__main__':
# create original array of matrices
original_matrix_list = []
work_list = []
#basic set up so we can run this test
for i in range(0, 100):
# generate the matrix & unit or work
matrix = NP.random.random_integers(0, 100, (500, 500))
#insert into respective resources
original_matrix_list.append(matrix)
for i, matrix in enumerate(original_matrix_list):
unit_of_work = [i, matrix]
work_list.append(unit_of_work)
work_result = run(work_list)
so work_result should hold all the eigenvalues from each matrix after all processes finish. And the iterator I am using is unit_of_work which is a list containing the index of the matrix (from the original_matrix_list) and the matrix itself.
The weird thing is, if I were to run this code by doing python matrix.py everything works perfectly. But when I use auto (a program that does calculations for differential equations?) to run my script, typing auto matrix.py gives me the following error:
Traceback (most recent call last):
File "matrix.py", line 50, in <module>
work_result = run(work_list)
File "matrix.py", line 27, in run
results = pool(delayed(computeEigenV)(unit_of_work) for unit_of_work in work_list)
File "/Library/Python/2.7/site-packages/joblib/parallel.py", line 805, in __call__
while self.dispatch_one_batch(iterator):
File "/Library/Python/2.7/site-packages/joblib/parallel.py", line 658, in dispatch_one_batch
tasks = BatchedCalls(itertools.islice(iterator, batch_size))
File "/Library/Python/2.7/site-packages/joblib/parallel.py", line 69, in __init__
self.items = list(iterator_slice)
File "matrix.py", line 27, in <genexpr>
results = pool(delayed(computeEigenV)(unit_of_work) for unit_of_work in work_list)
File "/Library/Python/2.7/site-packages/joblib/parallel.py", line 162, in delayed
pickle.dumps(function)
TypeError: expected string or Unicode object, NoneType found
Note: when I ran this with auto I had to change if __name__ == '__main__': to if __name__ == '__builtin__':
I looked up this error and it seems like I am not serializing the iterator unit_of_work correctly when passing it around to different processes. I have then tried to use serialized_unit_of_work = pickle.dumps(unit_of_work), pass that around, and do pickle.loads when I need to use the iterator, but I still get the same error.
Can someone please help point me in the right direction as to how I can fix this? I hesitate to use pickle.dump(obj, file[, protocol]) because eventually I will be running this to calculate eigenvalues of thousands of matrices and I don't really want to create that many files to store the serialized iterator if possible.
Thanks!! :)

You can't pickle an iterator in python2.7 (but you can from 3.4 onward).
Also, pickling works differently in __main__ is different than when not in __main__, and it would seem that auto is doing something odd with __main__. What you often will observe when pickling fails on a particular object is that if instead of running the script with the object in it directly, you run a script as main which imports the portion of the script with the "difficult-to-serialize" object, then pickling will succeed. This is because the object will pickle by reference at a namespace level above where the "difficult" object lives… thus it's never directly pickled.
So, you can probably get away with pickling what you want, by adding a reference layer… a file import or a class. But, if you want to pickle an iterator, you are out of luck unless you move to at least python3.4.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Factorializing Medium Numpy Ints Creates Runtime Warning - python

Related

ERROR : When trying deployment of model in ML I get-'dict' object is not callable

Is there an alternative to Numba for functions that use many features not supported by Numba?

Numba "Use of unsupported opcode (CONTINUE_LOOP) found" error when using continue in for loop

Pythons multiprocess module (with dill) gives an unhelpful AssertionError

Serialize iterator object to be passed between processes in Python

Categories

Resources