Passing second argument to function in pool.map - python

I have a panda dataframe with many rows, I am using multiprocessing to process grouped tables from this dataframe concurrently. It works fine but I have a problem passing in a second parameter, I have tried to pass both arguments as a Tuple but it doesn't work. My code is as follows:
I want to also pass in the parameter "col" to the function "process_table"
for col in cols:
tables = df.groupby('test')
p = Pool()
lines = p.map(process_table, table)
p.close()
p.join()
def process_table(t):
# Bunch of processing to create a line for matplotlib
return line

You could do this, it takes an iterable and expands it into individual arguments :
def expand(x):
return process_table(*x)
p.map(expand, table)
You might be tempted to do this:
p.map(lambda x: process_table(*x), table) # DOES NOT WORK
But it won't work because lambdas are unpickleable (if you don't know what this means, trust me).

Related

Can't use values returned from Python Multiprocessing

I am using the multiprocessing technique in a method within a class (the class will eventually be imported into the main class to be executed), and the code segment is the following
result = []
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
data = [[1,2],[3,4],[5,6],[7,8]]
result_m = [pool.starmap(some_function, data)]
pool.join()
pool.close()
result.append(result_m[0])
result.append(result_m[1])
result.append(result_m[2])
result.append(result_m[3])
first = result[0]
sceond = result[1]
third = result[2]
fourth = result[3]
Basically, the idea is to use multiprocessing to call another method named some_function() simultaneously in four processes, and then append the result from each process to a list named "result". However, when I run the code, it always says "list index out of range" for the last four lines. It seems that the results never got added to the list defined earlier. I am wondering why this is the case and what are some potential ways to fix this? Thank you!

Running dataframe apply in parallel with partial

I'm following the answer from this question: pandas multiprocessing apply
Usually when I run a function on rows in pandas, I do something like this
dataframe.apply(lambda row: process(row.attr1, row.attr2, ...))
...
def process(attr1, attr2, ...):
...
But I want to run this function multithreaded. So I implemented the parallelize_on_rows from the above posted question. However, the aforementioned solution works because the function passed in doesn't take parameters. For functions with parameters I tried to use partials. However, I can't figure out how to create a partial that takes arguments from the row that require the lambda function to access.
Here is my code
def parallelize_function_on_df(self, data, func, num_of_processes=5):
# refers to the data being split across array sections
data_split = np.array_split(data, num_of_processes)
# map a specific function to the array sections
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
# must call close before join, research why
pool.close()
pool.join()
return data
def run_on_df_subset(self, func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(self, data, func, num_of_processes=5):
return self.parallelize_function_on_df(data, partial(self.run_on_df_subset, func), num_of_processes)
def mass_download(some_sql):
download_table_df = pd.read_sql(some_sql, con=MYSQL.CONNECTION)
processed_data = {}
custom_option = True
process_row_partial = partial(self.process_candidate_row_parallel, processed_data, custom_option)
parallelize_on_rows(download_table_df, process_row_partial)
def process_candidate_row_parallel(row, processed_data, custom_option=False):
if row['some_attr'] in processed_data.keys() and processed_data[row['some_attr']] == 'download_successful' and custom_option:
do_some_other_processing()
download_single_file(row['some_attr1'], row['some_attr2'], processed_data)
So this doesn't work, because like I said, the row[attributes] aren't actually being passed to the thread since my partial just has the function with no arguments. How can I achieve otherwise?

python timer decorator - function takes effect on the original dataset which I put in the arguments

I'm relatively new to python decorator.
I have this decorator function.
def myTimer(func):
def wrapper(*args, **kargs):
t1 = time.time()
result = func(*args, **kargs)
t2 = time.time() - t1
print('Execution Time (function : {}) : {} sec'.format(func.__name__, t2))
return result
return wrapper
This is just a timer function.
And, I have a method which adds a column based on another columns.
#myTimer
def createID(dat):
dat['new'] = dat.apply(lambda x: '_'.join(map(str, x[4:8])), axis = 1)
return dat
This generates a new column whose values are just another column values combined by '_' separator.
Now, if I define the two functions above and run below,
tdat2 = createID(tdat)
And then tdat2 returns correctly but the change takes effect on tdat(original dataset), too.
I mean, tdat has 30 columns in the first place and tdat2 should have 31 columns, which is fine, but tdat also has the new column, too.
Is there any way I can fix this?
I have tried below and it works just fine for me, but I want the argument and return values the same('dat') because of the code convention, etc.
#myTimer
def createID2(dat):
result = dat.copy()
result['new'] = result.apply(lambda x: '_'.join(map(str, x[4:8])), axis = 1)
return result
Thanks in advance.
Some Notes
Since you don't have a class, createID is called a function. It's a black box: it takes in an input, does something and returns an output. Pythonically, functions are written in lowercase with underscores, e.g. create_id.
my_timer() is also a function that wraps the function it decorates, i.e. create_id(). Your decorator really doesn't do anything to your wrapped function except prints something (a side effect).
So whatever happens inside the decorated function is not influenced by your decorator. All it does is time how fast the function call runs.
The mutation problem you are describing is a pandas issue (see docs on View vs Copy). You have resolved it with the .copy() method.

Currying in inversed order in python

Suppose I have a function like this:
from toolz.curried import *
#curry
def foo(x, y):
print(x, y)
Then I can call:
foo(1,2)
foo(1)(2)
Both return the same as expected.
However, I would like to do something like this:
#curry.inverse # hypothetical
def bar(*args, last):
print(*args, last)
bar(1,2,3)(last)
The idea behind this is that I would like to pre-configure a function and then put it in a pipe like this:
pipe(data,
f1, # another function
bar(1,2,3) # unknown number of arguments
)
Then, bar(1,2,3)(data) would be called as a part of the pipe. However, I don't know how to do this. Any ideas? Thank you very much!
Edit:
A more illustrative example was asked for. Thus, here it comes:
import pandas as pd
from toolz.curried import *
df = pd.DataFrame(data)
def filter_columns(*args, df):
return df[[*args]]
pipe(df,
transformation_1,
transformation_2,
filter_columns("date", "temperature")
)
As you can see, the DataFrame is piped through the functions, and filter_columns is one of them. However, the function is pre-configured and returns a function that only takes a DataFrame, similar to a decorator. The same behaviour could be achieved with this:
def filter_columns(*args):
def f(df):
return df[[*args]]
return f
However, I would always have to run two calls then, e.g. filter_columns()(df), and that is what I would like to avoid.
well I am unfamiliar with toolz module, but it looks like there is no easy way of curry a function with arbitrary number of arguments, so lets try something else.
First as a alternative to
def filter_columns(*args):
def f(df):
return df[*args]
return f
(and by the way, df[*args] is a syntax error )
to avoid filter_columns()(data) you can just grab the last element in args and use the slice notation to grab everything else, for example
def filter_columns(*argv):
df, columns = argv[-1], argv[:-1]
return df[columns]
And use as filter_columns(df), filter_columns("date", "temperature", df), etc.
And then use functools.partial to construct your new, well partially applied, filter to build your pipe like for example
from functools import partial
from toolz.curried import pipe # always be explicit with your import, the last thing you want is import something you don't want to, that overwrite something else you use
pipe(df,
transformation_1,
transformation_2,
partial(filter_columns, "date", "temperature")
)

passing optional dataframe parameter in python

Rather than explicitly specifying the DataFrame columns in the code below, I'm trying to give an option of passing the name of the data frame in itself, without much success.
The code below gives a
"ValueError: Wrong number of dimensions" error.
I've tried another couple of ideas but they all lead to errors of one form or another.
Apart from this issue, when the parameters are passed as explicit DataFrame columns, p as a single column, and q as a list of columns, the code works as desired. Is there a clever (or indeed any) way of passing in the data frame so the columns can be assigned to it implicitly?
def cdf(p, q=[], datafr=None):
if datafr!=None:
p = datafr[p]
for i in range(len(q)):
q[i]=datafr[q[i]]
...
(calculate conditional probability tables for p|q)
to summarize:
current usage:
cdf(df['var1'], [df['var2'], df['var3']])
desired usage:
cdf('var1', ['var2', 'var3'], datafr=df)
Change if datafr != None: to if datafr is not None:
Pandas doesn't know which value in the dataframe you are trying to compare to None so it throws an error. is checks if both datafr and None are the pointing to the same object, which is a more stringent identity check. See this explanation.
Additional tips:
Python iterates over lists
#change this
for i in range(len(q)):
q[i]=datafr[q[i]]
#to this:
for i in q:
q[i] = datafr[q]
If q is a required parameter don't do q = [ ] when defining your function. If it is an optional parameter, ignore me.
Python can use position to match the arguments passed to the function call to with the parameters in the definition.
cdf('var1', ['var2', 'var3'], datafr=df)
#can be written as:
cdf('var1', ['var2', 'var3'], df)

Categories