How to multiprocess operations on dataframe

How to multiprocess operations on dataframe - python

I have to perform lots of operations on a dataframe and it takes a long time using a single core. I am trying to implement multiprocessing.
Right now while I am trying to figure out how it works so i am using a simpler version where i just want to add values from data
import multiprocessing
import pandas as pd
def add_values(a):
df = pd.DataFrame([{'n':a}])
return df
df = pd.DataFrame([{'n':0}])
data = [9, 4, 5]
with multiprocessing.Pool(processes=4) as pool:
df = df.add(pool.map(add_values, data))
df
I would like df to return a dataframe with n=18 but i get this error message ValueError: Unable to coerce to Series, length must be 1: given 3

The issue here is how you treat the return value from your multiprocessing calls. pool.map() returns a list. In this particular case, it will be a list of dataframes, i.e, what your call expands to is equivalent to df = df.add([dfn9, dfn4, dfn5]), where the dfnXs are different dataframes.
This input is neither expected nor handled by df.add(), which expects something that can be turned into a pd.Series object and added to the original frame. Instead, you need to take this list and "manually" reduce it, e.g. as:
import multiprocessing
import pandas as pd
def add_values(a):
df = pd.DataFrame([{'n':a}])
return df
df = pd.DataFrame([{'n':0}])
data = [9, 4, 5]
with multiprocessing.Pool(processes=4) as pool:
#df = df.add(pool.map(add_values, data)) does not work
dfs = pool.map(add_values, data)
print(type(dfs))
# Reducing return values
for d in dfs:
df = df.add(d)
print(df)
The reduction must happen in a single process, as the different processes do not share the same df (instead they all have identical copies).
As a side note, I think you should also consider using multithreading rahter than multiprocessing. It may be simpler as threads can share the same memory and reduce the need for copying memory. Also, since pandas reduces the GIL, there is not the problem of only one thread being able to execute at a time.

Related

How to save outputs (from xarray) from python dask delayed into a pandas dataframe

I am very new to to trying to parallelize my python code. I am trying to perform some analysis on an xarray, then fill in a pandas dataframe with the results. The columns of the dataframe are independent, so I think it should be trivial to parallelise using dask delayed, but can't work out how. My xarrays are quite big, so this loop takes a while, and is big in memory. It could also be chunked by time, instead, if that's easier (this might help with memory)!
Here is the un-parallelized version:
from time import sleep
import time
import pandas as pd
import dask.dataframe as dd
data1 = np.random.rand(4, 3,3)
data2=np.random.randint(4,size=(3,3))
locs1 = ["IA", "IL", "IN"]
locs2 = ['a', 'b', 'c']
times = pd.date_range("2000-01-01", periods=4)
xarray1 = xr.DataArray(data1, coords=[times, locs1, locs2], dims=["time", "space1", "space2"])
xarray2= xr.DataArray(data2, coords=[locs1, locs2], dims=[ "space1", "space2"])
def delayed_where(xarray1,xarray2,id):
sleep(1)
return xarray1.where(xarray2==id).mean(axis=(1,2)).to_dataframe(id)
final_df=pd.DataFrame(columns=range(4),index=times)
for column in final_df:
final_df[column]=delayed_where(xarray1,xarray2,column)
I would like to parallelize the for loop, but have tried:
final_df_delayed=pd.DataFrame(columns=range(4),index=times)
for column in final_df:
final_df_delayed[column]=delayed(delayed_where)(xarray1,xarray2,column)
final_df.compute()
Or maybe something with dask delayed?
final_df_dd=dd.from_pandas(final_df, npartitions=2)
for column in final_df:
final_df_dd[column]=delayed(delayed_where)(xarray1,xarray2,column)
final_df_dd.compute()
But none of these work. Can anyone help?

You're using delayed correctly, but it's not possible to construct a dask dataframe in the way you specified.
from dask import delayed
import dask
#delayed
def delayed_where(xarray1,xarray2,id):
sleep(1)
return xarray1.where(xarray2==id).mean(axis=(1,2)).to_dataframe(id)
#delayed
def form_df(list_col_results):
final_df=pd.DataFrame(columns=range(4),index=times)
for n, column in enumerate(final_df):
final_df[column]=list_col_results[n]
return final_df
delayed_cols = [delayed_where(xarray1,xarray2, col) for col in final_df.columns]
delayed_df = form_df(delayed_cols)
delayed_df.compute()
Note that the enumeration is a clumsy way to get correct order of the columns, but your actual problem might guide you to a better way of specifying this (e.g. by explicitly specifying each column as an individual argument).

Infinite Pool process while work is achived, in python with function applied on a sequence of pandas.DataFrame

I splited a pandas.DataFrame into 6 parts with a similar shape's rows: 118317,118315,... in order to balance the work and to respect the integrity of data to use groupby on a field.
These 6 parts are pandas.DataFrame stored in a list.
The function to apply, in parallel for each one, is the following.
def compute_recency (df):
recency = df.groupby('name').apply(lambda x: x['date'] - x['date'].shift()).fillna(0).reset_index()
df = df.join(recency.set_index('level_1'), rsuffix= '_f')
return df
Then I parallelized the processes as :
import multiprocessing as mp
cores=mp.cpu_count()
pool = mp.Pool(cores)
df_out = pool.map(compute_recency, list_of_6_dataframes)
pool.close()
pool.join()
The issue is it keeps calculating in the jupyter lab's notebook => [*], while I can see in my resources monitor that the CPUs are now "free", I mean they are not at 100% as at first.
Note if I use the following function:
def func(df):
return df.shape
It works well and quickly, no [*] for the eternity.
So I guess the issue is from the function compute_recency but I don't see why.
Can you help me ?
Pandas version: 0.23.4 Python version: 3.7.4

It's a little difficult to see what might be causing an issue here. Perhaps since you are using multiprocessing maybe break your data up into the groups created by groupby? and then process each group using multiprocessing?
from multiprocessing import Pool
groups = [x for _, x in df.groupby("name")]
def add_new_col(x):
x['new'] = x['date'] - x['date'].shift().fillna(0)
return x
p = Pool()
groups = p.map(add_new_col, groups)
df = pd.concat(groups, ignore_index=True)
p.close()
p.join()
by the way, in regards to your original code. p.map will return a list of dataframes not a dataframe. Which is why I've used pd.concat to combine the results at the end.

Pandas dataframe applymap parallel execution

I have the following functions to apply bunch of regexes to each element in a data frame. The dataframe that I am applying the regexes to is a 5MB chunk.
def apply_all_regexes(data, regexes):
# find all regex matches is applied to the pandas' dataframe
new_df = data.applymap(
partial(apply_re_to_cell, regexes))
return regex_applied
def apply_re_to_cell(regexes, cell):
cell = str(cell)
regex_matches = []
for regex in regexes:
regex_matches.extend(re.findall(regex, cell))
return regex_matches
Due to the serial execution of applymap, the time taken to process is ~ elements * (serial execution of the regexes for 1 element). Is there anyway to invoke parallelism? I tried ProcessPoolExecutor, but that appeared to take longer time than executing serially.

Have you tried splitting your one big dataframe in number of threads small dataframes, apply the regex map parallel and stick each small df back together?
I was able to do something similar with a dataframe about gene expression.
I would run it small scale and control if you get the expected output.
Unfortunately I don't have enough reputation to comment
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
for x in df_split:
print(x.shape)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
This is the general function I used

dask.DataFrame.apply and variable length data

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)

Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.

Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to multiprocess operations on dataframe - python

Related

How to save outputs (from xarray) from python dask delayed into a pandas dataframe

Infinite Pool process while work is achived, in python with function applied on a sequence of pandas.DataFrame

Pandas dataframe applymap parallel execution

dask.DataFrame.apply and variable length data

Should pandas dataframes be nested?

Categories

Resources