I have the following functions to apply bunch of regexes to each element in a data frame. The dataframe that I am applying the regexes to is a 5MB chunk.
def apply_all_regexes(data, regexes):
# find all regex matches is applied to the pandas' dataframe
new_df = data.applymap(
partial(apply_re_to_cell, regexes))
return regex_applied
def apply_re_to_cell(regexes, cell):
cell = str(cell)
regex_matches = []
for regex in regexes:
regex_matches.extend(re.findall(regex, cell))
return regex_matches
Due to the serial execution of applymap, the time taken to process is ~ elements * (serial execution of the regexes for 1 element). Is there anyway to invoke parallelism? I tried ProcessPoolExecutor, but that appeared to take longer time than executing serially.
Have you tried splitting your one big dataframe in number of threads small dataframes, apply the regex map parallel and stick each small df back together?
I was able to do something similar with a dataframe about gene expression.
I would run it small scale and control if you get the expected output.
Unfortunately I don't have enough reputation to comment
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
for x in df_split:
print(x.shape)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
This is the general function I used
Related
I splited a pandas.DataFrame into 6 parts with a similar shape's rows: 118317,118315,... in order to balance the work and to respect the integrity of data to use groupby on a field.
These 6 parts are pandas.DataFrame stored in a list.
The function to apply, in parallel for each one, is the following.
def compute_recency (df):
recency = df.groupby('name').apply(lambda x: x['date'] - x['date'].shift()).fillna(0).reset_index()
df = df.join(recency.set_index('level_1'), rsuffix= '_f')
return df
Then I parallelized the processes as :
import multiprocessing as mp
cores=mp.cpu_count()
pool = mp.Pool(cores)
df_out = pool.map(compute_recency, list_of_6_dataframes)
pool.close()
pool.join()
The issue is it keeps calculating in the jupyter lab's notebook => [*], while I can see in my resources monitor that the CPUs are now "free", I mean they are not at 100% as at first.
Note if I use the following function:
def func(df):
return df.shape
It works well and quickly, no [*] for the eternity.
So I guess the issue is from the function compute_recency but I don't see why.
Can you help me ?
Pandas version: 0.23.4 Python version: 3.7.4
It's a little difficult to see what might be causing an issue here. Perhaps since you are using multiprocessing maybe break your data up into the groups created by groupby? and then process each group using multiprocessing?
from multiprocessing import Pool
groups = [x for _, x in df.groupby("name")]
def add_new_col(x):
x['new'] = x['date'] - x['date'].shift().fillna(0)
return x
p = Pool()
groups = p.map(add_new_col, groups)
df = pd.concat(groups, ignore_index=True)
p.close()
p.join()
by the way, in regards to your original code. p.map will return a list of dataframes not a dataframe. Which is why I've used pd.concat to combine the results at the end.
I have to perform lots of operations on a dataframe and it takes a long time using a single core. I am trying to implement multiprocessing.
Right now while I am trying to figure out how it works so i am using a simpler version where i just want to add values from data
import multiprocessing
import pandas as pd
def add_values(a):
df = pd.DataFrame([{'n':a}])
return df
df = pd.DataFrame([{'n':0}])
data = [9, 4, 5]
with multiprocessing.Pool(processes=4) as pool:
df = df.add(pool.map(add_values, data))
df
I would like df to return a dataframe with n=18 but i get this error message ValueError: Unable to coerce to Series, length must be 1: given 3
The issue here is how you treat the return value from your multiprocessing calls. pool.map() returns a list. In this particular case, it will be a list of dataframes, i.e, what your call expands to is equivalent to df = df.add([dfn9, dfn4, dfn5]), where the dfnXs are different dataframes.
This input is neither expected nor handled by df.add(), which expects something that can be turned into a pd.Series object and added to the original frame. Instead, you need to take this list and "manually" reduce it, e.g. as:
import multiprocessing
import pandas as pd
def add_values(a):
df = pd.DataFrame([{'n':a}])
return df
df = pd.DataFrame([{'n':0}])
data = [9, 4, 5]
with multiprocessing.Pool(processes=4) as pool:
#df = df.add(pool.map(add_values, data)) does not work
dfs = pool.map(add_values, data)
print(type(dfs))
# Reducing return values
for d in dfs:
df = df.add(d)
print(df)
The reduction must happen in a single process, as the different processes do not share the same df (instead they all have identical copies).
As a side note, I think you should also consider using multithreading rahter than multiprocessing. It may be simpler as threads can share the same memory and reduce the need for copying memory. Also, since pandas reduces the GIL, there is not the problem of only one thread being able to execute at a time.
Problem statement: How do I parallelize a for loop that splits a pandas dataframe into two parts, applies a function to each part in parallel also, and stores the combined results from the function to a list to use after the loop is over?
For context, I am trying to parallelize my decision tree implementation. Many of the answers I have seen previously related to this question need the result of the function being applied to be a dataframe and the result is just concatenated into a big dataframe. I believe this question is slightly more general.
For example, this is the code I would like to parallelize:
# suppose we have some dataframe given to us
df = pd.DataFrame(....)
computation_results = []
# I would like to parallelize this whole loop and store the results of the
# computations in computation_results. min_rows and total_rows are known
# integers.
for i in range(min_rows, total_rows - min_rows + 1):
df_left = df.loc[range(0, i), :].copy()
df_right = df.loc[range(i, total_rows), :].copy()
# foo is a function that takes in a dataframe and returns some
# result that has no pointers to the passed dataframe. The following
# two function calls should also be parallelized.
left_results = foo(df_left)
right_results = foo(df_right)
# combine the results with some function and append that combination
# to a list. The order of the results in the list does not matter.
computation_results.append(combine_results(left_results, right_results))
# parallelization is not needed for the following function and the loop is over
use_computation_results(computation_results)
Check example in https://docs.python.org/3.3/library/multiprocessing.html#using-a-pool-of-workers.
So in your case:
with Pool(processes=2) as pool: # start 2 worker processes
for i in range(min_rows, total_rows - min_rows + 1):
df_left = df.loc[range(0, i), :].copy()
call_left = pool.apply_async(foo, df_left) # evaluate "foo(df_left)" asynchronously
df_right = df.loc[range(i, total_rows), :].copy()
call_right = pool.apply_async(foo, df_left) # evaluate "foo(df_right)" asynchronously
left_results = call_left.get(timeout=1) # wait and get left result
right_results = call_right.get(timeout=1) # wait and get right result
computation_results.append(combine_results(left_results, right_results))
I've implemented a fuzzy string matching algo between two dataframes just using pandas. My issue is how do I convert this to a dask operation using multiple cores? My program runs about 3-4 days on pure python, and I want to parallelize the operations to optimize time cost. I've already used the multiprocessing package to extract the number of cores using the code below:
numCores = multiprocessing.cpu_count()
fields = ['id','phase','new']
emb = pd.read_csv('my_csv.csv', skipinitialspace=True, usecols=fields)
Then I had to subdivide the dataframe emb into two dataframes (emb1, emb2) based on numeric values associated per string. As in I'm matching a dataframe with all elements having value to 3 to their corresponding value 2 in the other dataframe by matched string.The code for pure pandas operation is below.
emb1 = emb[emb.phase.isin([3.0])]
emb1.set_index('id',inplace=True)
emb2 = emb[emb.phase.isin([2.0,1.5])]
emb2.set_index('id',inplace=True)
def fuzzy_match(x, choices, scorer, cutoff):
return process.extractOne(x, choices=choices, scorer=scorer, score_cutoff=cutoff)
FuzzyWuzzyResults = pd.DataFrame(emb1.sort_index().loc[:,'strings'].apply(fuzzy_match, args = (emb2.loc[:,'strings'],fuzz.ratio,90)))
I sort of tried doing a dask implementation using this code:
emb1 = dd.from_pandas(emb1, npartitions=numCores)
emb2 = dd.from_pandas(emb2, npartitions=numCores)
But running the lambda function for two dataframes is confusing me. Any ideas?
So I just fixed my code to remove the manual partition of the dataframe and used groupby instead.
Here's the code:
for i in [2.0,1.5]:
FuzzyWuzzyResults = emb.map_partitions(lambda df: df.groupby('phase').get_group(3.0)['drugs'].apply(fuzzy_match, args=(df.groupby('phase').get_group(i)['drugs'],fuzz.ratio,90)), meta=('results')).compute()
Not sure whether it's accurate, but at least it's running, and that too on all CPU cores.
I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)
Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html