tqdm and numpy vectorize

tqdm and numpy vectorize - python

I am using a np.vectorize-ed function and would like to see the progress of the function with tqdm. However, I have not been able to figure out how to do this.
All the suggestions I have found relate to converting the calculation into a for-loop, or into a pd.DataFrame.

I finally found a method that works to get the tqdm progress bar to update with a np.vectorize function. I wrap the vectorize function using a with
with tqdm(total=len(my_inputs)) as pbar:
my_output = np.vectorize(my_function)(my_inputs)
in my_function() I then add the following lines
global pbar
pbar.update(1)
and voila! I now have a progress bar that updates with each iteration. Only slight performance dip on my code.
Note: when you instantiate the function it might complain that pbar is not yet defined. Simply put a pbar = 0 before you instantiate, and then the function will call the pbar defined by the with
Hope it helps everyone reading here.

To the best of my knowledge,tqdm does not wrap over numpy.vectorize.
To display the progress bar for numpy arrays, numpy.ndenumerate can be used.
Given the inputs and function:
import numpy as np
from tqdm import tqdm
a = np.array([1, 2, 3, 4])
b = 2
def myfunc(a, b):
"Return a-b if a>b, otherwise return a+b"
if a > b:
return a - b
else:
return a + b
Replace this vectorised part below
# using numpy.vectorize
vfunc = np.vectorize(myfunc)
vfunc(a, b)
with this
# using numpy.ndenumerate instead
[myfunc(x,b) for index, x in tqdm(np.ndenumerate(a))]
To see the tqdm progress.

Based on #Carl Kirstein's answer I came up with the following solution. I added the pbar element to my_function as an argument and updated it inside the function.
with tqdm(total=len(my_inputs)) as pbar:
my_output = np.vectorize(my_function)(my_inputs, pbar)
Somewhere inside my_function I somewhere added pbar.update(1).
def my_function(args, pbar):
...
pbar.update(1)
...

Related

Multiprocessing : Use process_map with many arg function

I found this answer (https://stackoverflow.com/a/59905309/7462275) to display a progress bar very very simple to use. I would like to use this simple solution for functions that take many arguments.
Following, the above mentioned answer, I write this code that works :
from tqdm.contrib.concurrent import process_map
import time
def _foo(my_tuple):
my_number1, my_number2 = my_tuple
square = my_number1 * my_number2
time.sleep(1)
return square
r = process_map(_foo, [(i,j) for i,j in zip(range(0,30),range(100,130))],max_workers=mp.cpu_count())
But I wonder, if it is the correct solution (using a tuple to assign function variable) to do that. Thanks for answer

multiprocessing pool.map on a function inside other function

Say I have a function that provides different results for the same input and needs to be performed multiple times for the same input to obtain mean (I'll sketch a trivial example, but in reality the source of randomness is train_test_split from sklearn.model_selection if that matters)
define f(a,b):
output=[]
for i in range(0,b):
output[i] = np.mean(np.random.rand(a,))
return np.mean(output)
The arguments for this function are defined inside another function like so (again, a trivial example, please don't mind if these are not efficient/pythonistic):
define g(c,d):
a = c
b = c*d
result=f(a,b)
return(result)
Instead of using a for loop, I want to use multiprocessing to speed up the execution time. I found that neither pool.apply nor pool.startmap do the trick (execution time goes up), only pool.map works. However, it can only take one argument (in this case - the number of iterations). I tried redefining f as follows:
define f(number_of_iterations):
output=np.mean(np.random.rand(a,))
return output
And then use pool.map as follows:
import multiprocessing as mp
define g(c,d):
temp=[]
a = c
b = c*d
pool = mp.Pool(mp.cpu_count())
temp = pool.map(f, [number_of_iterations for number_of_iterations in b])
pool.close()
result=np.mean(temp)
return(result)
Basically, a convoluted workaround to make f a one-argument function. The hope was that f would still pick up argument a, however, executing g results in an error about a not being defined.
Is there any way to make pool.map work in this context?

I think functool.partial solves your issue. Here is a implementation: https://stackoverflow.com/a/25553970/9177173 Here the documentation: https://docs.python.org/3.7/library/functools.html#functools.partial

Python - multiple functions - output of one to the next

I know this is super basic and I have been searching everywhere but I am still very confused by everything I'm seeing and am not sure the best way to do this and am having a hard time wrapping my head around it.
I have a script where I have multiple functions. I would like the first function to pass it's output to the second, then the second pass it's output to the third, etc. Each does it's own step in an overall process to the starting dataset.
For example, very simplified with bad names but this is to just get the basic structure:
#!/usr/bin/python
# script called process.py
import sys
infile = sys.argv[1]
def function_one():
do things
return function_one_output
def function_two():
take output from function_one, and do more things
return function_two_output
def function_three():
take output from function_two, do more things
return/print function_three_output
I want this to run as one script and print the output/write to new file or whatever which I know how to do. Just am unclear on how to pass the intermediate outputs of each function to the next etc.
infile -> function_one -> (intermediate1) -> function_two -> (intermediate2) -> function_three -> final result/outfile
I know I need to use return, but I am unsure how to call this at the end to get my final output
Individually?
function_one(infile)
function_two()
function_three()
or within each other?
function_three(function_two(function_one(infile)))
or within the actual function?
def function_one():
do things
return function_one_output
def function_two():
input_for_this_function = function_one()
# etc etc etc
Thank you friends, I am over complicating this and need a very simple way to understand it.

You could define a data streaming helper function
from functools import reduce
def flow(seed, *funcs):
return reduce(lambda arg, func: func(arg), funcs, seed)
flow(infile, function_one, function_two, function_three)
#for example
flow('HELLO', str.lower, str.capitalize, str.swapcase)
#returns 'hELLO'
edit
I would now suggest that a more "pythonic" way to implement the flow function above is:
def flow(seed, *funcs):
for func in funcs:
seed = func(seed)
return seed;

As ZdaR mentioned, you can run each function and store the result in a variable then pass it to the next function.
def function_one(file):
do things on file
return function_one_output
def function_two(myData):
doThings on myData
return function_two_output
def function_three(moreData):
doMoreThings on moreData
return/print function_three_output
def Main():
firstData = function_one(infile)
secondData = function_two(firstData)
function_three(secondData)
This is assuming your function_three would write to a file or doesn't need to return anything. Another method, if these three functions will always run together, is to call them inside function_three. For example...
def function_three(file):
firstStep = function_one(file)
secondStep = function_two(firstStep)
doThings on secondStep
return/print to file
Then all you have to do is call function_three in your main and pass it the file.

For safety, readability and debugging ease, I would temporarily store the results of each function.
def function_one():
do things
return function_one_output
def function_two(function_one_output):
take function_one_output and do more things
return function_two_output
def function_three(function_two_output):
take function_two_output and do more things
return/print function_three_output
result_one = function_one()
result_two = function_two(result_one)
result_three = function_three(result_two)
The added benefit here is that you can then check that each function is correct. If the end result isn't what you expected, just print the results you're getting or perform some other check to verify them. (also if you're running on the interpreter they will stay in namespace after the script ends for you to interactively test them)
result_one = function_one()
print result_one
result_two = function_two(result_one)
print result_two
result_three = function_three(result_two)
print result_three
Note: I used multiple result variables, but as PM 2Ring notes in a comment you could just reuse the name result over and over. That'd be particularly helpful if the results would be large variables.

It's always better (for readability, testability and maintainability) to keep your function as decoupled as possible, and to write them so the output only depends on the input whenever possible.
So in your case, the best way is to write each function independently, ie:
def function_one(arg):
do_something()
return function_one_result
def function_two(arg):
do_something_else()
return function_two_result
def function_three(arg):
do_yet_something_else()
return function_three_result
Once you're there, you can of course directly chain the calls:
result = function_three(function_two(function_one(arg)))
but you can also use intermediate variables and try/except blocks if needed for logging / debugging / error handling etc:
r1 = function_one(arg)
logger.debug("function_one returned %s", r1)
try:
r2 = function_two(r1)
except SomePossibleExceptio as e:
logger.exception("function_two raised %s for %s", e, r1)
# either return, re-reraise, ask the user what to do etc
return 42 # when in doubt, always return 42 !
else:
r3 = function_three(r2)
print "Yay ! result is %s" % r3
As an extra bonus, you can now reuse these three functions anywhere, each on it's own and in any order.
NB : of course there ARE cases where it just makes sense to call a function from another function... Like, if you end up writing:
result = function_three(function_two(function_one(arg)))
everywhere in your code AND it's not an accidental repetition, it might be time to wrap the whole in a single function:
def call_them_all(arg):
return function_three(function_two(function_one(arg)))
Note that in this case it might be better to decompose the calls, as you'll find out when you'll have to debug it...

I'd do it this way:
def function_one(x):
# do things
output = x ** 1
return output
def function_two(x):
output = x ** 2
return output
def function_three(x):
output = x ** 3
return output
Note that I have modified the functions to accept a single argument, x, and added a basic operation to each.
This has the advantage that each function is independent of the others (loosely coupled) which allows them to be reused in other ways. In the example above, function_two() returns the square of its argument, and function_three() the cube of its argument. Each can be called independently from elsewhere in your code, without being entangled in some hardcoded call chain such as you would have if called one function from another.
You can still call them like this:
>>> x = function_one(3)
>>> x
3
>>> x = function_two(x)
>>> x
9
>>> x = function_three(x)
>>> x
729
which lends itself to error checking, as others have pointed out.
Or like this:
>>> function_three(function_two(function_one(2)))
64
if you are sure that it's safe to do so.
And if you ever wanted to calculate the square or cube of a number, you can call function_two() or function_three() directly (but, of course, you would name the functions appropriately).

With d6tflow you can easily chain together complex data flows and execute them. You can quickly load input and output data for each task. It makes your workflow very clear and intuitive.
import d6tlflow
class Function_one(d6tflow.tasks.TaskCache):
function_one_output = do_things()
self.save(function_one_output) # instead of return
#d6tflow.requires(Function_one)
def Function_two(d6tflow.tasks.TaskCache):
output_from_function_one = self.inputLoad() # load function input
function_two_output = do_more_things()
self.save(function_two_output)
#d6tflow.requires(Function_two)
def Function_three():
output_from_function_two = self.inputLoad()
function_three_output = do_more_things()
self.save(function_three_output)
d6tflow.run(Function_three()) # executes all functions
function_one_output = Function_one().outputLoad() # get function output
function_three_output = Function_three().outputLoad()
It has many more useful features like parameter management, persistence, intelligent workflow management. See https://d6tflow.readthedocs.io/en/latest/

This way function_three(function_two(function_one(infile))) would be the best, you do not need global variables and each function is completely independent of the other.
Edited to add:
I would also say that function3 should not print anything, if you want to print the results returned use:
print function_three(function_two(function_one(infile)))
or something like:
output = function_three(function_two(function_one(infile)))
print output

Use parameters to pass the values:
def function1():
foo = do_stuff()
return function2(foo)
def function2(foo):
bar = do_more_stuff(foo)
return function3(bar)
def function3(bar):
baz = do_even_more_stuff(bar)
return baz
def main():
thing = function1()
print thing

FOOn(i, j, k) notation for nd numpy array in weave.inline()'s support_code argument? Are there any alternatives?

I want to use support_code to define functions that interact with nd numpy arrays. Inside the code argument, the FOO3(i, j, k) notation works, but only in it, not in support_code.Something like this:
import scipy
import scipy.weave
code = '''return_val = f(1);'''
support_code = '''int f(int i) {
return FOO3(i, i, i);
}''''
foo = scipy.arange(3**3).reshape(3,3,3)
print(scipy.weave.inline(code, ['foo'], support_code=support_code))

The concept of support code is mainly to do some includes. In your case, I guess the function should look something like this:
import scipy
import scipy.weave
def foofunc(i):
foo = scipy.arange(3**3).reshape(3,3,3)
code = '''#do something lengthy with foo and maybe i'''
scipy.weave.inline(code, ['foo', 'i']))
return foo[i,i,i]
You don't need support code at all, for what you're trying to do. You also don't have any speed improvement, when you try to do a function return in C instead of doing that in python, also array access is neglectable compared to the cost of the function call. To get a better idea, when and how weave can help you, to speed up your code, have a look here.

Python: Redefining function from within the function

I have some expensive function f(x) that I want to only calculate once, but is called rather frequently. In essence, the first time the function is called, it should compute a whole bunch of values for a range of x since it will be integrated over anyway and then interpolate that one with splines, and cache the coefficients somehow, possibly in a file for further use.
My idea was to do something like the following, since it would be pretty easy to implement. First time the function is called, it does something, then redefines itself, then does something else from then on. However, it does not work as expected and might in general be bad practice.
def f():
def g():
print(2)
print(1)
f = g
f()
f()
Expected output:
1
2
Actual output:
1
1
Defining g() outside of f() does not help. Why does this not work? Other than that, the only solution I can think of right now is to use some global variable. Or does it make sense to somehow write a class for this?

This is overly complicated. Instead, use memoization:
def memoized(f):
res = []
def resf():
if len(res) == 0
res.append(f())
return res[0]
return resf
and then simply
#memoized
def f():
# expensive calculation here ...
return calculated_value
In Python 3, you can replace memoized with functools.lru_cache.

Simply add global f at the beginning of the f function, otherwise python creates a local f variable.

Changing f in f's scope doesn't affect outside of function, if you want to change f, you could use global:
>>> def f():
... print(1)
... global f
... f=lambda: print(2)
...
>>> f()
1
>>> f()
2
>>> f()
2

You can use memoization and decoration to cache the result. See an example here. A separate question on memoization that might prove useful can be found here.

What you're describing is the kind of problem caching was invented for. Why not just have a buffer to hold the result; before doing the expensive calculation, check if the buffer is already filled; if so, return the buffered result, otherwise, execute the calculation, fill the buffer, and then return the result. No need to go all fancy with self-modifying code for this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

tqdm and numpy vectorize - python

I am using a np.vectorize-ed function and would like to see the progress of the function with tqdm. However, I have not been able to figure out how to do this. All the suggestions I have found relate to converting the calculation into a for-loop, or into a pd.DataFrame.

Related

Multiprocessing : Use process_map with many arg function

multiprocessing pool.map on a function inside other function

Python - multiple functions - output of one to the next

FOOn(i, j, k) notation for nd numpy array in weave.inline()'s support_code argument? Are there any alternatives?

Python: Redefining function from within the function

Categories

Resources