Running dataframe apply in parallel with partial - python

I'm following the answer from this question: pandas multiprocessing apply
Usually when I run a function on rows in pandas, I do something like this
dataframe.apply(lambda row: process(row.attr1, row.attr2, ...))
...
def process(attr1, attr2, ...):
...
But I want to run this function multithreaded. So I implemented the parallelize_on_rows from the above posted question. However, the aforementioned solution works because the function passed in doesn't take parameters. For functions with parameters I tried to use partials. However, I can't figure out how to create a partial that takes arguments from the row that require the lambda function to access.
Here is my code
def parallelize_function_on_df(self, data, func, num_of_processes=5):
# refers to the data being split across array sections
data_split = np.array_split(data, num_of_processes)
# map a specific function to the array sections
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
# must call close before join, research why
pool.close()
pool.join()
return data
def run_on_df_subset(self, func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(self, data, func, num_of_processes=5):
return self.parallelize_function_on_df(data, partial(self.run_on_df_subset, func), num_of_processes)
def mass_download(some_sql):
download_table_df = pd.read_sql(some_sql, con=MYSQL.CONNECTION)
processed_data = {}
custom_option = True
process_row_partial = partial(self.process_candidate_row_parallel, processed_data, custom_option)
parallelize_on_rows(download_table_df, process_row_partial)
def process_candidate_row_parallel(row, processed_data, custom_option=False):
if row['some_attr'] in processed_data.keys() and processed_data[row['some_attr']] == 'download_successful' and custom_option:
do_some_other_processing()
download_single_file(row['some_attr1'], row['some_attr2'], processed_data)
So this doesn't work, because like I said, the row[attributes] aren't actually being passed to the thread since my partial just has the function with no arguments. How can I achieve otherwise?

Related

How can I systematically reuse the results of delayed functions in Dask?

I am working on building a computation graph with Dask. Some of the intermediate values will be used multiple times, but I would like those calculations to only run once. I must be making a trivial mistake, because that's not what happens. Here is a minimal example:
In [1]: import dask
dask.__version__
Out [1]: '1.0.0'
In [2]: class SumGenerator(object):
def __init__(self):
self.sources = []
def register(self, source):
self.sources += [source]
def generate(self):
return dask.delayed(sum)([s() for s in self.sources])
In [3]: sg = SumGenerator()
In [4]: #dask.delayed
def source1():
return 1.
#dask.delayed
def source2():
return 2.
#dask.delayed
def source3():
return 3.
In [5]: sg.register(source1)
sg.register(source1)
sg.register(source2)
sg.register(source3)
In [6]: sg.generate().visualize()
Sadly I am unable to post the resulting graph image, but basically I see two separate nodes for the function source1 that was registered twice. Therefore the function is called twice. I would rather like to have it called once, the result remembered and added twice in the sum. What would be the correct way to do that?
You need to call the dask.delayed decorator by passing the pure=True argument.
From the dask delayed docs
delayed also accepts an optional keyword pure. If False, then subsequent calls will always produce a different Delayed
If you know a function is pure (output only depends on the input, with no global state), then you can set pure=True.
So using that
import dask
class SumGenerator(object):
def __init__(self):
self.sources = []
def register(self, source):
self.sources += [source]
def generate(self):
return dask.delayed(sum)([s() for s in self.sources])
#dask.delayed(pure=True)
def source1():
return 1.
#dask.delayed(pure=True)
def source2():
return 2.
#dask.delayed(pure=True)
def source3():
return 3.
sg = SumGenerator()
sg.register(source1)
sg.register(source1)
sg.register(source2)
sg.register(source3)
sg.generate().visualize()
Output and Graph
Using print(dask.compute(sg.generate())) gives (7.0,) which is the same as the one you wrote but without the extra node as seen in the image.

Passing second argument to function in pool.map

I have a panda dataframe with many rows, I am using multiprocessing to process grouped tables from this dataframe concurrently. It works fine but I have a problem passing in a second parameter, I have tried to pass both arguments as a Tuple but it doesn't work. My code is as follows:
I want to also pass in the parameter "col" to the function "process_table"
for col in cols:
tables = df.groupby('test')
p = Pool()
lines = p.map(process_table, table)
p.close()
p.join()
def process_table(t):
# Bunch of processing to create a line for matplotlib
return line
You could do this, it takes an iterable and expands it into individual arguments :
def expand(x):
return process_table(*x)
p.map(expand, table)
You might be tempted to do this:
p.map(lambda x: process_table(*x), table) # DOES NOT WORK
But it won't work because lambdas are unpickleable (if you don't know what this means, trust me).

python timer decorator - function takes effect on the original dataset which I put in the arguments

I'm relatively new to python decorator.
I have this decorator function.
def myTimer(func):
def wrapper(*args, **kargs):
t1 = time.time()
result = func(*args, **kargs)
t2 = time.time() - t1
print('Execution Time (function : {}) : {} sec'.format(func.__name__, t2))
return result
return wrapper
This is just a timer function.
And, I have a method which adds a column based on another columns.
#myTimer
def createID(dat):
dat['new'] = dat.apply(lambda x: '_'.join(map(str, x[4:8])), axis = 1)
return dat
This generates a new column whose values are just another column values combined by '_' separator.
Now, if I define the two functions above and run below,
tdat2 = createID(tdat)
And then tdat2 returns correctly but the change takes effect on tdat(original dataset), too.
I mean, tdat has 30 columns in the first place and tdat2 should have 31 columns, which is fine, but tdat also has the new column, too.
Is there any way I can fix this?
I have tried below and it works just fine for me, but I want the argument and return values the same('dat') because of the code convention, etc.
#myTimer
def createID2(dat):
result = dat.copy()
result['new'] = result.apply(lambda x: '_'.join(map(str, x[4:8])), axis = 1)
return result
Thanks in advance.
Some Notes
Since you don't have a class, createID is called a function. It's a black box: it takes in an input, does something and returns an output. Pythonically, functions are written in lowercase with underscores, e.g. create_id.
my_timer() is also a function that wraps the function it decorates, i.e. create_id(). Your decorator really doesn't do anything to your wrapped function except prints something (a side effect).
So whatever happens inside the decorated function is not influenced by your decorator. All it does is time how fast the function call runs.
The mutation problem you are describing is a pandas issue (see docs on View vs Copy). You have resolved it with the .copy() method.

Functional pipeline using python with advance operators

I am following the PyData talk in https://youtu.be/R1em4C0oXo8, the presenter whows a library for pipeling call yamal. This library is not open source. So, In my way of learning FP in python, I tried to replicate the basics of that library.
In a nutshell, you build a series of pure functions in python (f1, f2, f3, etc) , and create a list of them as follows:
pipeline = [f1, f2, f3, f4]
Then, you can apply the function run_pipeline, and the result will be the composition:
f4(f3(f2(f1)))
The requirements to the functions are that all have one return value, and except f1, all have one input.
This part is easy to implement I had it done using a composing the functions.
def run_pipeline(pipeline):
get_data, *rest_of_steps = steps
def compose(x):
for f in rest_of_steps:
y = f(x)
x = y
return x
data = get_data()
return compose(data)
The talk show a more advance use of the this abstraction, he defines the "operators" fork and reducer. This "operators" allow to run pipelines as the following:
pipeline1 = [ f1, fork(f2, f3), f4 ]
which is equivalent to: [ f4(f2(f1)), f4(f3(f1)) ]
and
pipeline2 = [ f1, fork(f2, f3), f4, reducer(f5) ]
which is equivalent to f5([f4(f3(f1)), f4(f2(f1))]).
I try to resolve this using functional programming, but I simply can't. I don't know if fork and reducer are decorators (and if so How do I pass the list of following functions?) don't know if I should transform this list to a graph using objects? coroutines? (maybe all of this is nonsense) I simply utterly confused.
Could someone help me about how to frame this using python and functional programming?
NOTE: In the video he talks about observers or executors. for this exercise I don't care about them.
Although this library is intended to facilitate FP in Python, it's not clear whether the library itself should be written using lots of FP.
This is one way to implement using classes (based on the list type) to tell the pipe function whether it needs to fork or reduce, and whether it is dealing with a single data item or a list of items.
This makes some limited use of FP style techniques such as the recursive calls to apply_func (allowing multiple forks within a pipeline).
class Forked(list):
""" Contains a list of data after forking """
class Fork(list):
""" Contains a list of functions for forking """
class Reducer(object):
""" Contains a function for reducing forked data """
def __init__(self, func):
self.func = func
def fork(*funcs):
return Fork(funcs)
def reducer(func):
""" Return a reducer form based on a function that accepts a
Forked list as its first argument """
return Reducer(func)
def apply_func(data, func):
""" Apply a function to data which may be forked """
if isinstance(data, Forked):
return Forked(apply_func(datum, func) for datum in data)
else:
return func(data)
def apply_form(data, form):
""" Apply a pipeline form (which may be a function, fork, or reducer)
to the data """
if callable(form):
return apply_func(data, form)
elif isinstance(form, Fork):
return Forked(apply_func(data, func) for func in form)
elif isinstance(form, Reducer):
return form.func(data)
def pipe(data, *forms):
""" Apply a pipeline of function forms to data """
return reduce(apply_form, forms, data)
Examples of this in use:
def double(x): return x * 2
def inc(x): return x + 1
def dec(x): return x - 1
def mult(L): return L[0] * L[1]
print pipe(10, inc, double) # 21
print pipe(10, fork(dec, inc), double) # [18, 22]
print pipe(10, fork(dec, inc), double, reducer(mult)) # 396
EDIT: This can also be simplified a bit further by making fork a function that returns a function and reducer a class that creates objects mimicking a function. Then the separate Fork and Reducer classes are no longer needed.
class Forked(list):
""" Contains a list of data after forking """
def fork(*funcs):
""" Return a function that will take data and output a forked
list of results of putting the data through several functions """
def inner(data):
return Forked(apply_form(data, func) for func in funcs)
return inner
class reducer(object):
def __init__(self, func):
self.func = func
def __call__(self, data):
return self.func(data)
def apply_form(data, form):
""" Apply a function or reducer to data which may be forked """
if isinstance(data, Forked) and not isinstance(form, reducer):
return Forked(apply_form(datum, form) for datum in data)
else:
return form(data)
def pipe(data, *forms):
""" Apply a pipeline of function forms to data """
return reduce(apply_form, forms, data)

Python - multiple functions - output of one to the next

I know this is super basic and I have been searching everywhere but I am still very confused by everything I'm seeing and am not sure the best way to do this and am having a hard time wrapping my head around it.
I have a script where I have multiple functions. I would like the first function to pass it's output to the second, then the second pass it's output to the third, etc. Each does it's own step in an overall process to the starting dataset.
For example, very simplified with bad names but this is to just get the basic structure:
#!/usr/bin/python
# script called process.py
import sys
infile = sys.argv[1]
def function_one():
do things
return function_one_output
def function_two():
take output from function_one, and do more things
return function_two_output
def function_three():
take output from function_two, do more things
return/print function_three_output
I want this to run as one script and print the output/write to new file or whatever which I know how to do. Just am unclear on how to pass the intermediate outputs of each function to the next etc.
infile -> function_one -> (intermediate1) -> function_two -> (intermediate2) -> function_three -> final result/outfile
I know I need to use return, but I am unsure how to call this at the end to get my final output
Individually?
function_one(infile)
function_two()
function_three()
or within each other?
function_three(function_two(function_one(infile)))
or within the actual function?
def function_one():
do things
return function_one_output
def function_two():
input_for_this_function = function_one()
# etc etc etc
Thank you friends, I am over complicating this and need a very simple way to understand it.
You could define a data streaming helper function
from functools import reduce
def flow(seed, *funcs):
return reduce(lambda arg, func: func(arg), funcs, seed)
flow(infile, function_one, function_two, function_three)
#for example
flow('HELLO', str.lower, str.capitalize, str.swapcase)
#returns 'hELLO'
edit
I would now suggest that a more "pythonic" way to implement the flow function above is:
def flow(seed, *funcs):
for func in funcs:
seed = func(seed)
return seed;
As ZdaR mentioned, you can run each function and store the result in a variable then pass it to the next function.
def function_one(file):
do things on file
return function_one_output
def function_two(myData):
doThings on myData
return function_two_output
def function_three(moreData):
doMoreThings on moreData
return/print function_three_output
def Main():
firstData = function_one(infile)
secondData = function_two(firstData)
function_three(secondData)
This is assuming your function_three would write to a file or doesn't need to return anything. Another method, if these three functions will always run together, is to call them inside function_three. For example...
def function_three(file):
firstStep = function_one(file)
secondStep = function_two(firstStep)
doThings on secondStep
return/print to file
Then all you have to do is call function_three in your main and pass it the file.
For safety, readability and debugging ease, I would temporarily store the results of each function.
def function_one():
do things
return function_one_output
def function_two(function_one_output):
take function_one_output and do more things
return function_two_output
def function_three(function_two_output):
take function_two_output and do more things
return/print function_three_output
result_one = function_one()
result_two = function_two(result_one)
result_three = function_three(result_two)
The added benefit here is that you can then check that each function is correct. If the end result isn't what you expected, just print the results you're getting or perform some other check to verify them. (also if you're running on the interpreter they will stay in namespace after the script ends for you to interactively test them)
result_one = function_one()
print result_one
result_two = function_two(result_one)
print result_two
result_three = function_three(result_two)
print result_three
Note: I used multiple result variables, but as PM 2Ring notes in a comment you could just reuse the name result over and over. That'd be particularly helpful if the results would be large variables.
It's always better (for readability, testability and maintainability) to keep your function as decoupled as possible, and to write them so the output only depends on the input whenever possible.
So in your case, the best way is to write each function independently, ie:
def function_one(arg):
do_something()
return function_one_result
def function_two(arg):
do_something_else()
return function_two_result
def function_three(arg):
do_yet_something_else()
return function_three_result
Once you're there, you can of course directly chain the calls:
result = function_three(function_two(function_one(arg)))
but you can also use intermediate variables and try/except blocks if needed for logging / debugging / error handling etc:
r1 = function_one(arg)
logger.debug("function_one returned %s", r1)
try:
r2 = function_two(r1)
except SomePossibleExceptio as e:
logger.exception("function_two raised %s for %s", e, r1)
# either return, re-reraise, ask the user what to do etc
return 42 # when in doubt, always return 42 !
else:
r3 = function_three(r2)
print "Yay ! result is %s" % r3
As an extra bonus, you can now reuse these three functions anywhere, each on it's own and in any order.
NB : of course there ARE cases where it just makes sense to call a function from another function... Like, if you end up writing:
result = function_three(function_two(function_one(arg)))
everywhere in your code AND it's not an accidental repetition, it might be time to wrap the whole in a single function:
def call_them_all(arg):
return function_three(function_two(function_one(arg)))
Note that in this case it might be better to decompose the calls, as you'll find out when you'll have to debug it...
I'd do it this way:
def function_one(x):
# do things
output = x ** 1
return output
def function_two(x):
output = x ** 2
return output
def function_three(x):
output = x ** 3
return output
Note that I have modified the functions to accept a single argument, x, and added a basic operation to each.
This has the advantage that each function is independent of the others (loosely coupled) which allows them to be reused in other ways. In the example above, function_two() returns the square of its argument, and function_three() the cube of its argument. Each can be called independently from elsewhere in your code, without being entangled in some hardcoded call chain such as you would have if called one function from another.
You can still call them like this:
>>> x = function_one(3)
>>> x
3
>>> x = function_two(x)
>>> x
9
>>> x = function_three(x)
>>> x
729
which lends itself to error checking, as others have pointed out.
Or like this:
>>> function_three(function_two(function_one(2)))
64
if you are sure that it's safe to do so.
And if you ever wanted to calculate the square or cube of a number, you can call function_two() or function_three() directly (but, of course, you would name the functions appropriately).
With d6tflow you can easily chain together complex data flows and execute them. You can quickly load input and output data for each task. It makes your workflow very clear and intuitive.
import d6tlflow
class Function_one(d6tflow.tasks.TaskCache):
function_one_output = do_things()
self.save(function_one_output) # instead of return
#d6tflow.requires(Function_one)
def Function_two(d6tflow.tasks.TaskCache):
output_from_function_one = self.inputLoad() # load function input
function_two_output = do_more_things()
self.save(function_two_output)
#d6tflow.requires(Function_two)
def Function_three():
output_from_function_two = self.inputLoad()
function_three_output = do_more_things()
self.save(function_three_output)
d6tflow.run(Function_three()) # executes all functions
function_one_output = Function_one().outputLoad() # get function output
function_three_output = Function_three().outputLoad()
It has many more useful features like parameter management, persistence, intelligent workflow management. See https://d6tflow.readthedocs.io/en/latest/
This way function_three(function_two(function_one(infile))) would be the best, you do not need global variables and each function is completely independent of the other.
Edited to add:
I would also say that function3 should not print anything, if you want to print the results returned use:
print function_three(function_two(function_one(infile)))
or something like:
output = function_three(function_two(function_one(infile)))
print output
Use parameters to pass the values:
def function1():
foo = do_stuff()
return function2(foo)
def function2(foo):
bar = do_more_stuff(foo)
return function3(bar)
def function3(bar):
baz = do_even_more_stuff(bar)
return baz
def main():
thing = function1()
print thing

Categories