parallelize dataframe splitting and processing - python

Problem statement: How do I parallelize a for loop that splits a pandas dataframe into two parts, applies a function to each part in parallel also, and stores the combined results from the function to a list to use after the loop is over?
For context, I am trying to parallelize my decision tree implementation. Many of the answers I have seen previously related to this question need the result of the function being applied to be a dataframe and the result is just concatenated into a big dataframe. I believe this question is slightly more general.
For example, this is the code I would like to parallelize:
# suppose we have some dataframe given to us
df = pd.DataFrame(....)
computation_results = []
# I would like to parallelize this whole loop and store the results of the
# computations in computation_results. min_rows and total_rows are known
# integers.
for i in range(min_rows, total_rows - min_rows + 1):
df_left = df.loc[range(0, i), :].copy()
df_right = df.loc[range(i, total_rows), :].copy()
# foo is a function that takes in a dataframe and returns some
# result that has no pointers to the passed dataframe. The following
# two function calls should also be parallelized.
left_results = foo(df_left)
right_results = foo(df_right)
# combine the results with some function and append that combination
# to a list. The order of the results in the list does not matter.
computation_results.append(combine_results(left_results, right_results))
# parallelization is not needed for the following function and the loop is over
use_computation_results(computation_results)

Check example in https://docs.python.org/3.3/library/multiprocessing.html#using-a-pool-of-workers.
So in your case:
with Pool(processes=2) as pool: # start 2 worker processes
for i in range(min_rows, total_rows - min_rows + 1):
df_left = df.loc[range(0, i), :].copy()
call_left = pool.apply_async(foo, df_left) # evaluate "foo(df_left)" asynchronously
df_right = df.loc[range(i, total_rows), :].copy()
call_right = pool.apply_async(foo, df_left) # evaluate "foo(df_right)" asynchronously
left_results = call_left.get(timeout=1) # wait and get left result
right_results = call_right.get(timeout=1) # wait and get right result
computation_results.append(combine_results(left_results, right_results))

Related

faster way to run a for loop for a very large dataframe list

I am using two for loops inside each other to calculate a value using combinations of elements in a dataframe list. the list consists of large number of dataframes and using two for loops takes considerable amount of time.
Is there a way i can do the operation faster?
the functions I refer with dummy names are the ones where I calculate the results.
My code looks like this:
conf_list = []
for tr in range(len(trajectories)):
df_1 = trajectories[tr]
if len(df_1) == 0:
continue
for tt in range(len(trajectories)):
df_2 = trajectories[tt]
if len(df_2) == 0:
continue
if df_1.equals(df_2) or df_1['time'].iloc[0] > df_2['time'].iloc[-1] or df_2['time'].iloc[0] > df_1['time'].iloc[-1]:
continue
df_temp = cartesian_product_basic(df_1,df_2)
flg, df_temp = another_function(df_temp)
if flg == 0:
continue
flg_h = some_other_function(df_temp)
if flg_h == 1:
conf_list.append(1)
My input list consist of around 5000 dataframes looking like (having several hundreds of rows)
id
x
y
z
time
1
5
7
2
5
and what i do is I get the cartesian product with combinations of two dataframes and for each couple I calculate another value 'c'. If this value c meets a condition then I add an element to my c_list so that I can get the final number of couples meeting the requirement.
For further info;
a_function(df_1, df_2) is a function getting the cartesian product of two dataframes.
another_function looks like this:
def another_function(df_temp):
df_temp['z_dif'] = nwh((df_temp['time_x'] == df_temp['time_y'])
, abs(df_temp['z_x']- df_temp['z_y']) , np.nan)
df_temp = df_temp.dropna()
df_temp['vert_conf'] = nwh((df_temp['z_dif'] >= 1000)
, np.nan , 1)
df_temp = df_temp.dropna()
if len(df_temp) == 0:
flg = 0
else:
flg = 1
return flg, df_temp
and some_other_function looks like this:
def some_other_function(df_temp):
df_temp['x_dif'] = df_temp['x_x']*df_temp['x_y']
df_temp['y_dif'] = df_temp['y_x']*df_temp['y_y']
df_temp['hor_dif'] = hypot(df_temp['x_dif'], df_temp['y_dif'])
df_temp['conf'] = np.where((df_temp['hor_dif']<=5)
, 1 , np.nan)
if df_temp['conf'].sum()>0:
flg_h = 1
return flg_h
The following are the way to make your code run faster:
Instead of for-loop use list comprehension.
use built-in functions like map, filter, sum ect, this would make your code faster.
Do not use '.' or dot operants, for example
Import datetime
A=datetime.datetime.now() #dont use this
From datetime.datetime import now as timenow
A=timenow()# use this
Use c/c++ based operation libraries like numpy.
Don't convert datatypes unnecessarily.
in infinite loops, use 1 instead of "True"
Use built-in Libraries.
if the data would not change, convert it to a tuple
Use String Concatenation
Use Multiple Assignments
Use Generators
When using if-else to check a Boolean value, avoid using assignment operator.
# Instead of Below approach
if a==1:
print('a is 1')
else:
print('a is 0')
# Try this approach
if a:
print('a is 1')
else:
print('a is 0')
# This would help as a portion of time is reduce which was used in check the 2 values.
Usefull references:
Speeding up Python Code: Fast Filtering and Slow Loops
Speed Up Python Code

Pandas dataframe applymap parallel execution

I have the following functions to apply bunch of regexes to each element in a data frame. The dataframe that I am applying the regexes to is a 5MB chunk.
def apply_all_regexes(data, regexes):
# find all regex matches is applied to the pandas' dataframe
new_df = data.applymap(
partial(apply_re_to_cell, regexes))
return regex_applied
def apply_re_to_cell(regexes, cell):
cell = str(cell)
regex_matches = []
for regex in regexes:
regex_matches.extend(re.findall(regex, cell))
return regex_matches
Due to the serial execution of applymap, the time taken to process is ~ elements * (serial execution of the regexes for 1 element). Is there anyway to invoke parallelism? I tried ProcessPoolExecutor, but that appeared to take longer time than executing serially.
Have you tried splitting your one big dataframe in number of threads small dataframes, apply the regex map parallel and stick each small df back together?
I was able to do something similar with a dataframe about gene expression.
I would run it small scale and control if you get the expected output.
Unfortunately I don't have enough reputation to comment
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
for x in df_split:
print(x.shape)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
This is the general function I used

Iterative union of multiple dataframes in PySpark

I am trying to concatenate multiple dataframes using unionAll function in pyspark.
This is what I do :
df_list = []
for i in range(something):
normalizer = Normalizer(inputCol="features", outputCol="norm", p=1)
norm_df = normalizer.transform(some_df)
norm_df = norm_df.repartition(320)
data = index_df(norm_df)
data.persist()
mat = IndexedRowMatrix(
data.select("id", "norm")\
.rdd.map(lambda row: IndexedRow(row.id, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
df = dot.toIndexedRowMatrix().rows.toDF()
df_list.append(df)
big_df = reduce(unionAll, df_list)
big_df.write.mode('append').parquet('some_path')
I want to do that because the writing part takes time and therefore, it is much faster to write one big file than n small files in my case.
The problem is that when I write big_df and check Spark UI, I have way too many tasks for writing parquet. While my goal is to write ONE big dataframe, it actually writes all the sub-dataframes.
Any guess?
Spark is lazy evaluated.
The write operation is the action that trigger all previous transformations. Therefore those tasks are for those transformations, not just for writing parquets.

Scoping in for loop with if statement- pandas.append does not work in loop

The piece of code returns 10, which is what I would expect
for i in range(5):
if i == 0:
output = i
else:
output += i
print(output)
Why does this code only return the dataframe created in the if section of the statement (i.e. when i ==0)?
for i in range(5):
if i == 0:
output = pd.DataFrame(np.random.randn(5, 2))
else:
output.append(pd.DataFrame(np.random.randn(5, 2))
print('final', output)
The above is the MVCE of an issue I am having with this below code:
More context if interested:
for index, row in per_dmd_df.iterrows():
if index == 0:
output = pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly'))
else:
output.append(pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly')))
print(output)
Where I have an input DataFrame with one row per product with balances, rates, etc. I want to the data in each DF row to call the dmd_flow function (returns a generator that when called within pd.Dataframe() returns a 12 month forward-looking balance forecast) to forecast changes in the balance of each product based on the parameters in the dmd_flow function. I would then add all of the changes to come up with the net changes in balance (done using group by on the date and summing balances).
Each call to this creates thew new DataFrame as I need:
pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly'))
but the append doesn't work to expande the output DataFrame.
Because, (unlike list.append) DataFrame.append is not an in-place operation. See the docs for more information. You're supposed to assign the result back:
df = df.append(...)
Although, in this case, I'd advice using something like apply if you are unable to vectorize your function:
df['balance'].apply(
dmd_flow, args=(dt.date(2018,1,31), 12, .05, 0, .03, 'monthly')
)
Which hides the loop, so you don't need to worry about the index. Make sure your function is written in such a way so as to support scalar arguments.

dask.DataFrame.apply and variable length data

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)
Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html

Categories