parallelize a function of multiple arguments but over one of the arguments - python

I have function of processing a relatively large dataframe and run time takes quite a while. I was looking at ways of improving run time and I've come across multiprocessing pool. If I understood correctly, this should run the function for the equal chunks of the dataframe in parallel, which means it could potentially run quicker and save time.
So my function takes 4 different arguments, the last three of them are just mainly lookups, while the first one of the four is the data of interest dataframe. so looks something like this:
def functionExample(dataOfInterest, lookup1, lookup2, lookup3):
#do stuff with the data and lookups)
return output1, output2
So based on what I've read, I come to the below way of what I thought should work:
num_partitions = 4
num_cores = 4
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
Then to call the process (where mainly I couldn't figure it out), I've tried the below:
output1, output2= parallelize_dataframe(dataOfInterest, functionExample))
This returns the error:
functionExample() missing 3 required positional arguments: 'lookup1', 'lookup2', and 'lookup3'
Then I try adding the three arguments by doing the below:
output1, output2= parallelize_dataframe(dataOfInterest, functionExample(lookup1, lookup2, lookup3))
This returns the error below suggesting that it took the three arguments as the first three arguments of the function and missing the fourth instead of them being the last three arguments the previous error suggested they were missing:
functionExample() missing 1 required positional arguments: 'lookup1'
and then if I try feeding it the four arguments by doing the below:
output1, output2= parallelize_dataframe(dataOfInterest, functionExample(dataOfInterest, lookup1, lookup2, lookup3))
It returns the error below:
'tuple' object is not callable
I'm not quite sure which of the above is the way to do it, if any at all. Should it be taking all of the functions arguments including the desired dataframe. If so, why is it complaining about tuples?
Any help would be appreciated!
Thanks.

You can perform a partial binding of some arguments to create a new callable via functools.partial:
from functools import partial
output1, output2 = parallelize_dataframe(dataOfInterest,
partial(functionExample, lookup1=lookup1, lookup2=lookup2, lookup3=lookup3))
Note that in the multiprocessing world, partial can be slow, so you may want to find a way to avoid the need to pass the arguments if they're large/expensive to pickle, assuming that's possible in your use case.

In each case, you are trying to call the function, rather than pass the arguments for when the function is called. What you need is a new callable that calls your original with the correct argument.
from functools import partial
output1, output2 = parallelize_dataframe(
dataOfInterest,
partial(functionExample, lookup1=x, lookup2=y, lookup3=z)
)

You could simply modify your function definition to take predefined arguments, or make a function that call your original function using that params.
def functionExample(dataOfInterest, lookup1=x, lookup2=y, lookup3=z):
#do stuff with the data and lookups)
return output1, output2
or
def f(dataOfInterest):
return functionExample(dataOfInterest, lookup1=x, lookup2=y, lookup3=z)
In this way, map() would work as you expect.

Related

How to set a ** parameter in Python

I'm newbie in Python.
I'm using Python 3.7.7 and Tensorflow 2.1.0.
This is my code:
import tensorflow as tf
import tensorflow_datasets as tfds
d = {"name": "omniglot:3.0.0", "data_dir": "d:\\tmp"}
omniglot_builder = tfds.builder("omniglot:3.0.0", builder_init_kwargs=d)
omniglot_builder.download_and_prepare(download_dir="d:\\tmp")
But I get this error:
got an unexpected keyword argument 'builder_init_kwargs'
I want to set data_dir, but I don't know how to do it. I have tried to set download_dir in omniglot_builder.download_and_prepare(download_dir="d:\\tmp") but it stills download it to ~/tensorflow_datasets.
From Tensorflow documentation for tdfs.builder:
**builder_init_kwargs: dict of keyword arguments passed to the DatasetBuilder. These will override keyword arguments passed in name,
if any.
How can I set builder_init_kwargs parameter value?
Based on the docs, which say the tfds.builder method has type:
tfds.builder(
name, **builder_init_kwargs
)
You want to do this:
dict = {"name":"omniglot:3.0.0", "data_dir": "d:\\tmp"}
tfds.builder(**dict)
The ** syntax passes a variable as the kwargs, making the above code equivalent to:
tfds.builder(name="omniglot:3.0.0", data_dir="d:\\tmp")
To set a kwargs argument in python, you have to simply add the ** before the argument itself.
So, this would be your code:
import tensorflow as tf
import tensorflow_datasets as tfds
dict = {"name": "omniglot:3.0.0", "data_dir": "d:\\tmp"}
omniglot_builder = tfds.builder("omniglot:3.0.0", builder_init_kwargs=**dict)
omniglot_builder.download_and_prepare(download_dir="d:\\tmp")
Of course, I am just guessing, because I know what a kwargs argument is, but I am not familiar with tensorflow.
Hope this helps!
It seems you need a little help with argument packing and unpacking.
In the definition of a function or method, you specify the sequence of arguments that will be passed. If you want to have a variable number of input arguments, the mechanism is to "pack" them together into a list or directory. For example say you want to get the sum of all arguments given:
def get_sum(a, b): #only useful for two numbers
return a + b
def get_sum(a,b,c): #only useful for three numbers
return a + b
You would have to have a different definition for every possible number of input arguments. The solution to this is to use the packing operator to pack all arguments given into a list that can be iterated over
def get_sum(*list_of_inputs): # * will pack all subsequent positional arguments into a list
x = 0
for item in list_of_inputs:
x += item
return x
get_sum(1,2,3,4,5,6,7) #returns 28
get_sum() #returns 0
The same can be done for keyword arguments which get packed into a dictionary:
def foo(**keyword_args):
for k in keyword_args:
print(f'{k}: {keyword_args[k]}')
Now when you are using (calling) a function, sometimes you need to be able to "unpack" a list or a dictionary into the function call. The same operator is used to pack and unpack, so it looks very similar:
def foo(a,b,c):
print(f'{a} + {b} = {c}')
arguments = ['spam', 'eggs', 'delicious']
foo(*arguments) #unpack the list of arguments into their required positions
Now finally on to your specific case: the function you are trying to use defines **kwargs in its definition. This means that it will take any subsequent keyword arguments and pack them all up into a dictionary to be used inside the function definition. The practical meaning of this is that you can provide keyword arguments to the function that aren't specifically defined in the function signature (this is particularly common when the function is calling another function and passing along the arguments). If you have already packed up your arguments prior to calling the function, it is easy to unpack them using the same process as shown by Oli: tfds.builder(**dict)

Caching in python using *args and lambda functions

I recently attempted Googles foo.bar challenge. After my time was up I decided to try find a solution to the problem I couldn't do and found a solution here (includes the problem statement if you're interested). I'd previously been making a dictionary for every function I wanted to cache but it looks like in this solution any function/input can be cached using the same syntax.
Firstly I'm confused on how the code is even working, the *args variable isn't inputted as an argument (and prints to nothing). Heres an modified minimal example to illustrate my confusion:
mem = {}
def memoize(key, func, *args):
"""
Helper to memoize the output of a function
"""
print(args)
if key not in mem:
# store the output of the function in memory
mem[key] = func(*args)
return mem[key]
def example(n):
return memoize(
n,
lambda: longrun(n),
)
def example2(n):
return memoize(
n,
longrun(n),
)
def longrun(n):
for i in range(10000):
for j in range(100000):
2**10
return n
Here I use the same memoize function but with a print. The function example returns memoize(n, a lambda function,). The function longrun is just an identity function with lots of useless computation so it's easy to see if the cache is working (example(2) will take ~5 seconds the first time and be almost instant after).
Here are my confusions:
Why is the third argument of memoize empty? When args is printed in memoize it prints (). Yet somehow mem[key] stores func(*args) as func(key)?
Why does this behavior only work when using the lambda function (example will cache but example2 won't)? I thought lambda: longrun(n) is just a short way of giving as input a function which returns longrun(n).
As a bonus, does anyone know how you could memoize functions using a decorator?
Also I couldn't think of a more descriptive title, edits welcome. Thanks.
The notation *args stands for a variable number of positional arguments. For example, print can be used as print(1), print(1, 2), print(1, 2, 3) and so on. Similarly, **kwargs stands for a variable number of keyword arguments.
Note that the names args and kwargs are just a convention - it's the * and ** symbols that make them variadic.
Anyways, memoize uses this to accept basically any input to func. If the result of func isn't cached, it's called with the arguments. In a function call, *args is basically the reverse of *args in a function definition. For example, the following are equivalent:
# provide *args explicitly
print(1, 2, 3)
# unpack iterable to *args
arguments = 1, 2, 3
print(*arguments)
If args is empty, then calling print(*args) is the same as calling print() - no arguments are passed to it.
Functions and lambda functions are the same in python. It's simply a different notation for creating a function object.
The problem is that in example2, you are not passing a function. You call a function, then pass on its result. Instead, you have to pass on the function and its argument separately.
def example2(n):
return memoize(
n,
longrun, # no () means no call, just the function object
# all following parameters are put into *args
n
)
Now, some implementation details: why is args empty and why is there a separate key?
The empty args comes from your definition of the lambda. Let's write that as a function for clarity:
def example3(n):
def nonlambda():
return longrun(n)
return memoize(n, nonlambda)
Note how nonlambda takes no arguments. The parameter n is bound from the containing scope as a closure, bound from the containing scope. As such, you don't have to pass it to memoize - it is already bound inside the nonlambda. Thus, args is empty in memoize, even though longrun does receive a parameter, because the two don't interact directly.
Now, why is it mem[key] = f(*args), not mem[key] = f(key)? That's actually slightly the wrong question; the right question is "why isn't it mem[f, args] = f(*args)?".
Memoization works because the same input to the same function leads to the same output. That is, f, args identifies your output. Ideally, your key would be f, args as that's the only relevant information.
The problem is you need a way to look up f and args inside mem. If you ever tried putting a list inside a dict, you know there are some types which don't work in mappings (or any other suitable lookup structure, for that matter). So if you define key = f, args, you cannot memoize functions taking mutable/unhashable types. Python's functools.lru_cache actually has this limitation.
Defining an explicit key is one way of solving this problem. It has the advantage that the caller can select an appropriate key, for example taking n without any modifications. This offers the best optimization potential. However, it breaks easily - using just n misses out the actual function called. Memoizing a second function with the same input would break your cache.
There are alternative approaches, each with pros and cons. Common is the explicit conversion of types: list to tuple, set to frozenset, and so on. This is slow, but the most precise. Another approach is to just call str or repr as in key = repr((f, args, sorted(kwargs.items()))), but it relies on every value having a proper repr.

How to convert to multithreading subprocess

I have a method in my python (2.7.6) code that I am looking to use multithreading subprocess on by following the advice given in another SO question
This is how the code is currently:
return self.capi(roi_rgb,"",False)
This is how I converted it:
pool = multiprocessing.Pool(None)
result = ""
r = pool.map_async(self.capi(roi_rgb,"",False), callback=result)
r.wait()
return result
but I'm getting errors with the above on the call to pool.map_async
TypeError: map_async() takes at least 3 arguments (3 given)
According to https://docs.python.org/2/library/multiprocessing.html you need to give at least 2 positional arguments where as you gave it one positional and one keyword argument. (The third implicit arg is self)
So you need to pass the method a function and an iterable along with the callback.
P.s.that is a pretty useless error message isn't it?

mapping functions with *args using lambda

Okay this one is confusing. My old piece of code has something like
map(lambda x:x.func1(arg1), other_args_to_be_mapped)
now I would like to make arg1 -> *args
while other_args_to_be_mapped stays unchanged.
in func1, the length of arguments will be checked different operations. My questions are
1) which length will be checked? arg1 or other_args_to_be_mapped
2) in func1, how should I set up the default? It was like
def func1(arg1=something)
but now with potential multiple arguments, I don't know what to do with the initialization. I want to be able to do something like
def func1(args*=something, something_else)
Is that even possible?
If I understand your question correctly, you're looking for variable arguments. These can be mixed with fixed arguments, provided you obey a logical ordering (fixed arguments first, then keyword arguments or variable arguments).
For example, the following shows how map to a function that takes in one constant argument and one variable argument. If you would like different behaviour, please provide a concrete example of what you are trying to accomplish
import random
class Foo:
def get_variable_parameters(self):
return [1] if random.random() > .5 else [1,2]
def foo( self, arg, *args ):
print("Doing stuff with constant arg", arg)
if len(args) == 1:
print("Good",args)
else:
print("Bad",args)
list(map( lambda x : x.foo( 'Static Argument', *x.get_variable_parameters()), [Foo(),Foo(),Foo()] ))
We don't know how many arguments are going to be passed to foo (in this trivial case, it's one or two), but the "*" notation accepts any number of objects to be passed
Note I've encapsulated map in list so that it gets evaluated, as in python3 it is a generator. List comprehension may be more idiomatic in python. Also don't forget you can always use a simple for loop - an obfuscated or complex map call is far less pythonic than a clear (but several line) for-loop, imo.
If, rather, you're trying to combine multiple arguments in a map call, I would recommend using the same variable argument strategy with the zip function, e.g.,
def foo(a,*b): ...
map(lambda x : foo(x[0],*x[1]), zip(['a','b'],[ [1], [1,2] ]))
In this case, foo will get called first as foo('a',1), and then as foo('b',2,3)

Hoy can I pass "whole" cli arguments to a function in python

I'm getting stuck with this
I have a python file which is imported from elsewhere as a module, in order to use some functions provided by it. I'm trying a way to call it form CLI, giving it 0 or 5 arguments.
def simulate(index, sourcefile, temperature_file, save=0, outfile='fig.png'):
(...)
# do calculations and spit a nice graph file.
if __name__ == '__main__':
if (len(sys.argv) == 6 ):
# ugly code alert
simulate(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5])
else:
(...)
#do some other things and don't bother me
I was wondering if there's a clean way to pass all but first argument to a function.
I tried simulate(sys.argv[1:]) but it throws a single object (list), and since simulate function expects 4 arguments, it doesn't work: TypeError: 'simulate() takes at least 3 arguments (1 given)'
Tried also with simulate(itertools.chain(sys.argv[1:])) with same result.
Since this file is imported elsewhere as a module and this function is being called many times, it seems a bad idea to change the function's signature to recieve a single argument
simulate(*sys.argv[1:])
See "Unpacking Argument Lists" in the tutorial
What you want to use is called "Packing/Unpacking" in Python:
foo(*sys.argv)
See: http://en.wikibooks.org/wiki/Python_Programming/Tuples#Packing_and_Unpacking
If you want "all but first argument":
foo(*sys.argv[1:])
This is called "slicing". See: http://docs.python.org/2.3/whatsnew/section-slices.html

Categories