I have a dataframe of 350k rows and one column (named 'text').
I want to apply this function to my dataset:
def extract_keyphrases(caption, n):
extractor = pke.unsupervised.TopicRank()
extractor.load_document(caption)
extractor.candidate_selection(pos=pos, stoplist=stoplist)
extractor.candidate_weighting(threshold=0.74, method='average')
keyphrases = extractor.get_n_best(n=n, stemming=False)
return(keyphrases)
df['keywords'] = df.apply(lambda row: (extract_keyphrases(row['text'],10)),axis=1)
But if I run it, it takes a lot of time to complete (nearly 50 hours).
It is possible to use chunksize or other methods to parallelize dataframe operations and how?
Thank you for your time!
Use multiprocessing module. To avoid an overhead by creating one process per row, each process handles 20,000 rows:
import multiprocessing
def extract_keyphrases(caption, n):
...
def extract_keyphrases_batch(captions):
for caption in captions:
extract_keyphrases(caption, 10)
def get_chunks(df, size):
for i in range(0, len(df), size):
yield df.iloc[i:min(i+size, len(df))]
if __name__ == '__main__':
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(extract_keyphrases_batch, get_chunks(df, 20000))
out = pd.concat(data)
Related
I have a dataframe and want to do add a column by taking first 3 digits of a base column using multiprocessing.
Please see below the python code:
import multiprocessing as mp
import pandas as pd
import numpy as np
data = pd.DataFrame({'employee':['Donald','Douglas','Jennifer','Michael','Pat','Susan','Hermann','Shelley','William',
'Steven','Neena','Lex','Alexander','Bruce','David','Valli','Diana','Nancy','Daniel','John'],
'PHONE_NUMBER':['650.507.9833','650.507.9844','515.123.4444','515.123.5555','603.123.6666',
'515.123.7777','515.123.8888','515.123.8080','515.123.8181','515.123.4567','515.123.4568',
'515.123.4569','590.423.4567','590.423.4568','590.423.4569','590.423.4560','590.423.5567',
'515.124.4569','515.124.4169','515.124.4269']})
# Part3- Multiprocessing thread
def strip_digits(x):
return str(x)[:3]
def city_code(x):
x['start_digits'] = x['PHONE_NUMBER'].apply(strip_digits)
return x
def parallelize(df,func):
df_split = np.array_split(df,partitions)
pool = mp.Pool(cores)
df_retun = pd.concat(pool.map(func,df_split), ignore_index=True)
pool.close
return df_retun
if __name__=='__main__':
mp.set_start_method('spawn')
cores = mp.cpu_count()
partitions = cores
df = parallelize(data, city_code)
group_data = df.groupby(['start_digits'])
group_size = group_data.size()
print(group_data.get_group('515'))
I am getting different Attributes error. please help me to identify the error in the code. This is a sample dataframe. I want to do similar task for large dataframe using multi-processing.
Thanks in advance.
I have stumbled across similar questions but all of them used either .csv or .txt files and a solution was to read in the data line by line or in chunks. I am not aware of that being possible with parquet files as they were designed to be read by columns and not rows.
My first attempt below works well with a smaller subset/test dataset created from the original full dataset.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the groups into chunks of groups
groupby_split = np.array_split(groupby, num_processes)
# Create a list where each element is a chunk of groups
# The starmap expects this sort of input
input_list = [[gb] for gb in groupby_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
x.to_parquet('path/to/save/file.parquet')
but when I switch to the full parquet file, I get the error:
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
which I expected.
My solution to this was to break the very large number of groups into smaller chunks of groups (size similar to the subset) and loop over each one like with the subset earlier.
EDIT I forgot to add the second np.array_split within the loop.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the large number of groups into smaller chunks
N = 10
groupby_split = np.array_split(groupby, N)
final_agg_df = pd.DataFrame()
# iterate over each of the smaller chunks
for groupby_group in groupby_split:
groupby_split_split = np.array_split(groupby_group, num_processes)
# Create a list to use as argument for starmap
input_list = [[gb] for gb in groupby_split_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
final_agg_df = pd.concat([final_agg_df, x])
final_agg_df.to_parquet('path/to/save/file.parquet')
But this is still giving me the same error...
I thought that since the pool was created prior to reading in the large parquet file (I read a solution earlier that mentioned doing this) that each process would only be given the small chunk of groups.
I am wondering if there is something I missed? And also if there is a better way of doing this in general (queue? dask? function logic?)
Thanks in advance!
I have a massive dataset that could use multicore processing.
I have a dataframe that has sequences and blocksize for each row.
I wrote a loop that extracts the sequence and block size for each row and calculates a score from a function from a package called localcider.
I can't figure out how to run it in parallel.
Can somebody help?
omega = []
AA=list('FYW')
for i, row in df.iterrows():
seq = df['IDRseq'][i]
b = df['bsize'][i]
bsize = [b-1,b]
SeqOb = SequenceParameters(seq,blobsize=bsize)
omega.append(SeqOb.get_kappa_X(AA))
s1 = pd.Series(omega, name='omega')
df = df.assign(omega=s1.values)
After a lot of googling, I came across pandarallel.
I think this is the most intuitive way of doing what I want.
I am posting the code for future reference.
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers = n)
# nb_workers = n ; I set the nb_workers fo CPU core - 1 so the system is more stable
def something(x):
#do stuff
return result
df['result'] = df.parallel_apply(something, axis=1)
I have to process a huge pandas.DataFrame (several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds). So I had the idea to split up the frame into chunks and process each chunk in parallel using multiprocessing. This does speed-up the task, but the memory consumption is a nightmare.
Although each child process should in principle only consume a tiny chunk of the data, it needs (almost) as much memory as the original parent process that contained the original DataFrame. Even deleting the used parts in the parent process does not help.
I wrote a minimal example that replicates this behavior. The only thing it does is creating a large DataFrame with random numbers, chunk it into little pieces with at most 100 rows, and simply print some information about the DataFrame during multiprocessing (here via a mp.Pool of size 4).
The main function that is executed in parallel:
def just_wait_and_print_len_and_idx(df):
"""Waits for 5 seconds and prints df length and first and last index"""
# Extract some info
idx_values = df.index.values
first_idx, last_idx = idx_values[0], idx_values[-1]
length = len(df)
pid = os.getpid()
# Waste some CPU cycles
time.sleep(1)
# Print the info
print('First idx {}, last idx {} and len {} '
'from process {}'.format(first_idx, last_idx, length, pid))
The helper generator to chunk a DataFrame into little pieces:
def df_chunking(df, chunksize):
"""Splits df into chunks, drops data of original df inplace"""
count = 0 # Counter for chunks
while len(df):
count += 1
print('Preparing chunk {}'.format(count))
# Return df chunk
yield df.iloc[:chunksize].copy()
# Delete data in place because it is no longer needed
df.drop(df.index[:chunksize], inplace=True)
And the main routine:
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, chunksize))
pool.close()
pool.join()
print('DONE')
The standard output looks like this:
Starting MP
Preparing chunk 1
Preparing chunk 2
First idx 0, last idx 99 and len 100 from process 9913
First idx 100, last idx 199 and len 100 from process 9914
Preparing chunk 3
First idx 200, last idx 299 and len 100 from process 9915
Preparing chunk 4
...
DONE
The Problem:
The main process needs about 120MB of memory. However, the child processes of the pool need the same amount of memory, although they only contain 1% of the original DataFame (chunks of size 100 vs original length of 10000). Why?
What can I do about it? Does Python (3) send the whole DataFrame to each child process despite my chunking? Is that a problem of pandas memory management or the fault of multiprocessing and data pickling? Thanks!
Whole script for simple copy and paste in case you want to try it yourself:
import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import os
def just_wait_and_print_len_and_idx(df):
"""Waits for 5 seconds and prints df length and first and last index"""
# Extract some info
idx_values = df.index.values
first_idx, last_idx = idx_values[0], idx_values[-1]
length = len(df)
pid = os.getpid()
# Waste some CPU cycles
time.sleep(1)
# Print the info
print('First idx {}, last idx {} and len {} '
'from process {}'.format(first_idx, last_idx, length, pid))
def df_chunking(df, chunksize):
"""Splits df into chunks, drops data of original df inplace"""
count = 0 # Counter for chunks
while len(df):
count += 1
print('Preparing chunk {}'.format(count))
# Return df chunk
yield df.iloc[:chunksize].copy()
# Delete data in place because it is no longer needed
df.drop(df.index[:chunksize], inplace=True)
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, chunksize))
pool.close()
pool.join()
print('DONE')
if __name__ == '__main__':
main()
Ok, so I figured it out after the hint by Sebastian Opałczyński in the comments.
The problem is that the child processes are forked from the parent, so all of them contain a reference to the original DataFrame. However, the frame is manipulated in the original process, so the copy-on-write behavior kills the whole thing slowly and eventually when the limit of the physical memory is reached.
There is a simple solution: Instead of pool = mp.Pool(n_jobs), I use the new context feature of multiprocessing:
ctx = mp.get_context('spawn')
pool = ctx.Pool(n_jobs)
This guarantees that the Pool processes are just spawned and not forked from the parent process. Accordingly, none of them has access to the original DataFrame and all of them only need a tiny fraction of the parent's memory.
Note that the mp.get_context('spawn') is only available in Python 3.4 and newer.
A better implementation is just to use the pandas implementation of chunked dataframe as a generator and feed it into the "pool.imap" function
pd.read_csv('<filepath>.csv', chucksize=<chunksize>)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Benefit: It doesn't read in the whole df in your main process (save memory). Each child process will be pointed the chunk it needs only. --> solve the child memory issue.
Overhead: It requires you to save your df as csv first and read it in again using pd.read_csv --> I/O time.
Note: chunksize is not available to pd.read_pickle or other loading methods that are compressed on storage.
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
df_chunked = pd.read_csv('<filepath>.csv',chunksize = chunksize) # modified
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, df_chunked) # modified
pool.close()
pool.join()
print('DONE')
I'm trying to use multiprocessing with pandas dataframe, that is split the dataframe to 8 parts. apply some function to each part using apply (with each part processed in different process).
EDIT:
Here's the solution I finally found:
import multiprocessing as mp
import pandas.util.testing as pdt
def process_apply(x):
# do some stuff to data here
def process(df):
res = df.apply(process_apply, axis=1)
return res
if __name__ == '__main__':
p = mp.Pool(processes=8)
split_dfs = np.array_split(big_df,8)
pool_results = p.map(aoi_proc, split_dfs)
p.close()
p.join()
# merging parts processed by different processes
parts = pd.concat(pool_results, axis=0)
# merging newly calculated parts to big_df
big_df = pd.concat([big_df, parts], axis=1)
# checking if the dfs were merged correctly
pdt.assert_series_equal(parts['id'], big_df['id'])
You can use https://github.com/nalepae/pandarallel, as in the following example:
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
A more generic version based on the author solution, that allows to run it on every function and dataframe:
from multiprocessing import Pool
from functools import partial
import numpy as np
def parallelize(data, func, num_of_processes=8):
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
def run_on_subset(func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(data, func, num_of_processes=8):
return parallelize(data, partial(run_on_subset, func), num_of_processes)
So the following line:
df.apply(some_func, axis=1)
Will become:
parallelize_on_rows(df, some_func)
This is some code that I found useful. Automatically splits the dataframe into however many cpu cores you have.
import pandas as pd
import numpy as np
import multiprocessing as mp
def parallelize_dataframe(df, func):
num_processes = mp.cpu_count()
df_split = np.array_split(df, num_processes)
with mp.Pool(num_processes) as p:
df = pd.concat(p.map(func, df_split))
return df
def parallelize_function(df):
df[column_output] = df[column_input].apply(example_function)
return df
def example_function(x):
x = x*2
return x
To run:
df_output = parallelize_dataframe(df, parallelize_function)
This worked well for me:
rows_iter = (row for _, row in df.iterrows())
with multiprocessing.Pool() as pool:
df['new_column'] = pool.map(process_apply, rows_iter)
Since I don't have much of your data script, this is a guess, but I'd suggest using p.map instead of apply_async with the callback.
p = mp.Pool(8)
pool_results = p.map(process, np.array_split(big_df,8))
p.close()
p.join()
results = []
for result in pool_results:
results.extend(result)
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
def process_apply(x):
# do some stuff to data here
def process(df):
# spawns a pathos.multiprocessing.ProcessPool if sensible
res = df.mapply(process_apply, axis=1)
return res
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
You could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
I also run into the same problem when I use multiprocessing.map() to apply function to different chunk of a large dataframe.
I just want to add several points just in case other people run into the same problem as I do.
remember to add if __name__ == '__main__':
execute the file in a .py file, if you use ipython/jupyter notebook, then you can not run multiprocessing (this is true for my case, though I have no clue)
Install Pyxtension that simplifies using parallel map and use like this:
from pyxtension.streams import stream
big_df = pd.concat(stream(np.array_split(df, multiprocessing.cpu_count())).mpmap(process))
I ended up using concurrent.futures.ProcessPoolExecutor.map in place of multiprocessing.Pool.map which took 316 microseconds for some code that took 12 seconds in serial.
Python's pool.starmap() method can be used to succinctly introduce parallelism also to apply use cases where column values are passed as arguments, i.e. to cases like:
df.apply(lambda row: my_func(row["col_1"], row["col_2"], ...), axis=1)
Full example and benchmarking:
import time
from multiprocessing import Pool
import numpy as np
import pandas as pd
def mul(a, b, c):
# For illustration, could obviously be vectorized
return a * b * c
df = pd.DataFrame(np.random.randint(0, 100, size=(10_000_000, 3)), columns=list('ABC'))
# Standard apply
start = time.time()
df["mul"] = df.apply(lambda row: mul(row["A"], row["B"], row["C"]), axis=1)
print(f"Standard apply took {time.time() - start:.0f} seconds.")
# Starmap apply
start = time.time()
with Pool(10) as pool:
df["mul_pool"] = pool.starmap(mul, zip(df["A"], df["B"], df["C"]))
print(f"Starmap apply took {time.time() - start:.0f} seconds.")
pd.testing.assert_series_equal(df["mul"], df["mul_pool"], check_names=False)
>>> Standard apply took 72 seconds.
>>> Starmap apply took 5 seconds.
This has the benefit of not relying on external libraries, plus being very readable.
Tom Raz's answer https://stackoverflow.com/a/53135031/11847090 misses an edge case where there are fewer rows in the dataframe than processes
use this parallelize method instead
def parallelize(data, func, num_of_processes=8):
# check if the number of rows is less than the number of processes
# to avoid the following error
# ValueError: Expected a 1D array, got an array with shape
num_rows = len(data)
if num_rows == 0:
return None
elif num_rows < num_of_processes:
num_of_processes = num_rows
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
and also I used dask bag to multithread this instead of this custom code