I'm using the multiprocessing python library to run in parallel feature selection for a machine learning problem. This function accepts as input a pandas dataframe and returns some figures.
When I execute this function using mp.pool.map() everything runs smoothly. However, if I substitute it with mp.pool.ThreadPool.map() it fails with this error:
AssertionError: Number of manager items must equal union of block items
# manager items: 15, # tot_items: 20.
Strangely, I was running the ThreadPool code normally till yesterday. Then, I tried to re-run it and started getting these errors. I need ThreadPool since this is an IO bound job and it was running much faster compared to pool.
EDIT:
The code goes like that (python 2.7):
import multiprocessing as mp
import pandas as pd (version 0.22.0)
def main_functionality(df, params):
df = df[params['feature']]
#Run 5-fold cross-validation
data_df = pd.DataFrame(....)
pred_df = pred_df.append(data_df)
return statistics from pred_df
def a_function(df_init, feature, params_init):
params = dict(params_init)
df = df_init.copy()
params['feature'] = feature
try:
results = main_functionality(df, params)
except:
results = (0,0,0)
return results
def b_function(df, features):
pool = mp.pool.ThreadPool(4)
params = {...}
results = pool.map(a_function,(df, feature, params) for f in features))
results_df = pd.DataFrame(results)
results_df.to_csv(...)
if __name__ == '__main__':
df = read.csv(...) # A big CSV file (i.e. few GBs)
features = [i for i in df.columns if i ....]
b_function(df, features)
Related
I'm reading data from databases. I need to read from several servers (nodes) simultaniuosly, so I want to use pool.map.
I'm trying to do this way:
import pathos.pools as pp
import pandas as pd
import urllib
class DataProvider():
def __init__(self, hosts):
self.hosts_read = hosts
def read_data(self, host_index):
'''
Read data from current node
'''
limit = 1000000
host = self.hosts_read[host_index]
query = f"select FIELD1 from table_name limit {limit}"
url = urllib.parse.urlencode({'query': query})
df = pd.io.parsers.read_csv(f'http://{host}:8123/?{url}',
sep="\t", names=['FIELD1'], low_memory=False)
return df
def pool_read(self, num_workers):
'''
Read from data using Pool of workers.
Return list of lists - every list is a data from current worker.
'''
pool = pp.ProcessPool(num_workers)
result = pool.map(self.read_data, range(len(self.hosts_read)))
return result
if __name__ == '__main__':
provider = DataProvider(host=['server01.com', 'server02.com'])
data = provider.pool_read(num_workers=n_cpu)
It works perfect while limit is not so much (below 4 millions). And crushes if it is bigger:
multiprocess.pool.MaybeEncodingError: Error sending result:
'[my_pandas_dataframe]'. Reason: 'error("'i' format requires
-2147483648 <= number <= 2147483647")'
I found some answers about it: it is because we cannot return from the pool peace of data bigger than 2 GB. For example: SO link. But there is no any ideas or solutions, how to work if I need load bigger parts!
P.S. I use pathos module but it is not important here - the same error for multiprocessing module too.
I am trying to increase the speed of my program in Python using multiprocessing, but it doesn't actually create any more processes. I've watched a few tutorials but I'm not getting anywhere.
Here it is:
cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.map(OSGBtoETRSfunc(data, eastcol, northcol))
p.close()
p.join()
return output
So to me, this should create 2 processes on a quadcore machine, but it doesn't. My CPU util sits around 18%...
Any insight? It looks the same as the tutorials I have watched... The p.map was not working when listing arguments in square brackets ([]) so I presumed it would need to be in the syntax it is above?
Thanks
I don't clearly understand what do you want, so let's start from simple. The following is a way to simply call the same function over the rows of pd dataframe:
import pandas as pd
import numpy as np
import os
import pathos
from contextlib import closing
NUM_PROCESSES = os.cpu_count()
# create some data frame 100x4
nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))
# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process
# function to be run over rows
# it transforms the given row independently
def foo(idx):
# extract given row to numpy
row = df.iloc[[idx]].values[0]
# you can pass ranges:
# df[2:3]
# transform row
# I return it as list for simplicity of creating dataframe
row = np.exp(row)
# return numpy row
return row
# run pool over range of indexes (0,1, ... , nrow-1)
# and close it afterwars
# there is not reason here to have more workers than number of CPUs
with closing(pathos.multiprocessing.Pool(processes=NUM_PROCESSES)) as pool:
results = pool.map(foo, range(nrow))
# create new dataframe from all those numpy slices:
col_names = df.columns.values.tolist()
df_new = pd.DataFrame(np.array(results), columns=col_names)
What in your computation needs more complicated setup?
EDIT: Ok, here is running two functions concurrently (I am not much familiar with pandas, so just switch to numpy):
# RUNNING TWO FUNCTIONS SIMLTANEOUSLY
import pandas as pd
import numpy as np
from multiprocessing import Process, Queue
# create some data frame 100x4
nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))
# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process
# function to be run over part1 independently
def proc_func1(q1):
# get data from queue1
data1 = q1.get()
# I extract given data to numpy
data_numpy = data1.values
# do something
data_numpy_new = data_numpy + 1
# return numpy array to queue 1
q1.put(data_numpy_new)
return
# function to be run over part2 independently
def proc_func2(q2):
# get data from queue2
data2 = q2.get()
# I extract given data to numpy
data_numpy = data2.values
# do something
data_numpy_new = data_numpy - 1
# return numpy array to queue 2
q2.put(data_numpy_new)
return
# instantiate queues
q1 = Queue()
q2 = Queue()
# divide data frame into two parts
part1 = df[:50]
part2 = df[50:]
# send data, so it will already be in queries
q1.put(part1)
q2.put(part2)
# start two processes
p1 = Process(target=proc_func1, args=(q1,))
p2 = Process(target=proc_func2, args=(q2,))
p1.start()
p2.start()
# wait until they finish
p1.join()
p2.join()
# read results from Queues
res1 = q1.get()
res2 = q2.get()
if (res1 is None) or (res2 is None):
print('Error!')
# reassemble two results back to single dataframe (might be inefficient)
col_names = df.columns.values.tolist()
# concatenate results along x axis
df_new = pd.DataFrame(np.concatenate([np.array(res1), np.array(res2)], axis=0), columns=col_names)
In Python you should provide the function and the arguments separated. If not, you are executing the function OSGBtoETRSfunc at the time of creating the process. Instead, you should provide the pointer to the function, and a list with the arguments.
Your case is similar to the one shown on Python Docs: https://docs.python.org/3.7/library/multiprocessing.html#introduction
Anyway, I think you are using the wrong function. Pool.map() works as map: on a list of items and applies the same function to each item. I think that your function OSGBtoERTSfunc needs the three params in order to work properly. Please, instead of using p.map(), use p.apply()
cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.apply(OSGBtoETRSfunc, [data, eastcol, northcol])
p.close()
p.join()
return output
with a basic pandas df of financial market OHLCV data, I am trying to add numerous calculated columns to the df. The large number of columns and calculations is making this SLOW SLOW SLOW!
Trying to multiprocess with pool.map, but getting nowhere.
Ideally, each iteration of the loop should be sent to a discrete thread. Simplified moving averages in code below.
Shown simple dictionary and rolling mean works SLOWLY
TypeError: map() missing 1 required positional argument: 'iterable'
All help appreciated-thx
import pandas as pd
from multiprocessing.dummy import Pool as ThreadPool
#####################################################
# DJIA_OHLCV_test.csv has format:
# Date,Open,High,Low,Close,Adj Close,Volume
#
1/2/2015,17823.07031,17951.7793,17731.30078,17832.99023,17832.99023,76270000
#
1/3/2015,17823.07031,17951.7793,17731.30078,17832.99023,17832.99023,76270000
DJIA = pd.read_csv('DJIA_OHLCV_test.csv')
"""
#####################################################
# # This works! please comment out to switch
# MAdict = {'MA50':50, 'MA100':100, 'MA200':200} # Define Moving Average
Windows
# for MAkey in MAdict:
# DJIA[('ma' + MAkey)] = pd.Series.rolling(DJIA['Adj Close'],
window=MAdict[MAkey]).mean()
#####################################################
"""
# This doesn't work! please comment out to switch
MAdict = {'MA50':50, 'MA100':100, 'MA200':200}
pool = ThreadPool(3)
def moving_average(MAkey):
return pd.Series.rolling(DJIA['Adj Close'], window=MAdict[MAkey]).mean()
for MAkey in MAdict:
DJIA[('ma' + MAkey)] = pool.map(moving_average(MAkey))
#####################################################
print(DJIA.tail())
pool.map is a blocking call, so instead of iterating over MAdict and calling pool.map you need to pass the iterable directly as an argument to pool.map:
import pandas as pd
from multiprocessing.dummy import Pool
def moving_average(ma):
return pd.Series.rolling(djia['Adj Close'], window=ma).mean()
if __name__ == '__main__':
N_WORKERS = 3
MA_DICT = {'MA50':50, 'MA100':100, 'MA200':200}
djia = pd.read_csv('DJIA_OHLCV_test.csv')
with Pool(N_WORKERS) as pool:
results = pool.map(moving_average, iterable=MA_DICT.values())
# concatenate results and rename columns
results = pd.concat(results, axis=1)
results.columns = ['ma' + key for key in MA_DICT]
djia = pd.concat([djia, results], axis=1)
print(djia.tail())
I am working on a task where I need to determine where two geospatial points are within 250 meters of each other and occur within 20 minutes of each other. My data set is approximately 1.2M rows and 10 columns. So, I need to determine a distance, time difference, and whether they meet my criteria by going through 1.2M**2 calculations.
I have been able to run the code below where I create 10,000 Dask objects to compute without problem. However, when I attempt to test 100,000 objects Dask runs up against memory limitations and I see significant CPU usage for swap. To be clear, I'm running this on a 32 core node with 125 GB of memory.
Admittedly, I'm quite new to Dask, so I'd like to know: is there a better way to solve this problem than processing in 10,000 row chunks?
#!/usr/bin/env python
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.array import sqrt
import time
import multiprocessing as mp
df = pd.read_hdf(...) # Used to select single item for comparison
ddf = dd.read_hdf(...) # Used for Dask operations
def distCheck(item,df=ddf):
'''
Determine if any records in df are within 250m of item and within 20
minutes of item. Return Dask object for calculation.
'''
dist = sqrt(((ddf.LCC_x1-item.LCC_x1)**2+(ddf.LCC_y1-item.LCC_y1)**2))
distcrit = dist[dist < 250]
delta = (ddf.Date - item.Date).abs()
timecrit = delta[delta < np.timedelta64(20,'m')]
res1 = ddf.copy()
res1['dist'] = dist
res1['delta'] = delta
res1 = res1.loc[(distcrit.index) & (timecrit.index) & (idcrit.index)]
res1['MatchMMSI'] = item.MMSI
res1['MatchVoy'] = item.Voyage
out = res1
return out
def getDaskCalls(start,stop):
'''
Get Dask objects to assess temporal and spatial proximity for df
indices from start to stop.
'''
# Kick off multiprocessing pool, submit, and close
pool = mp.Pool(processes=32)
daskers = []
for i in range(start,stop):
result = pool.apply_async(distCheck,args=(df.iloc[i,:],ddf,))
daskers.append(result)
dasky = [i.get() for i in daskers]
pool.close()
return dasky
def runDask(calls):
result = pd.DataFrame([],columns=calls[0].columns)
output = dd.compute(calls)
result = pd.concat([result]+[i for i in output[0] if i.shape[0] != 0])
return result
###
### Process
###
# Get initial timestamp
start = time.time()
# Create Dask Calls & determine duration
dcalls = getDaskCalls(0,10000)
callsCreated = time.time()
# Print time required to create calls
print("Dask Calls Created.")
print(callsCreated-start)
# Compute the calls with Dask
print("Computing...")
result = runDask(dcalls)
# Print the time for computation
computation = time.time()
print(" ...Done.")
print(computation-callsCreated)
I'm trying to use multiprocessing with pandas dataframe, that is split the dataframe to 8 parts. apply some function to each part using apply (with each part processed in different process).
EDIT:
Here's the solution I finally found:
import multiprocessing as mp
import pandas.util.testing as pdt
def process_apply(x):
# do some stuff to data here
def process(df):
res = df.apply(process_apply, axis=1)
return res
if __name__ == '__main__':
p = mp.Pool(processes=8)
split_dfs = np.array_split(big_df,8)
pool_results = p.map(aoi_proc, split_dfs)
p.close()
p.join()
# merging parts processed by different processes
parts = pd.concat(pool_results, axis=0)
# merging newly calculated parts to big_df
big_df = pd.concat([big_df, parts], axis=1)
# checking if the dfs were merged correctly
pdt.assert_series_equal(parts['id'], big_df['id'])
You can use https://github.com/nalepae/pandarallel, as in the following example:
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
A more generic version based on the author solution, that allows to run it on every function and dataframe:
from multiprocessing import Pool
from functools import partial
import numpy as np
def parallelize(data, func, num_of_processes=8):
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
def run_on_subset(func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(data, func, num_of_processes=8):
return parallelize(data, partial(run_on_subset, func), num_of_processes)
So the following line:
df.apply(some_func, axis=1)
Will become:
parallelize_on_rows(df, some_func)
This is some code that I found useful. Automatically splits the dataframe into however many cpu cores you have.
import pandas as pd
import numpy as np
import multiprocessing as mp
def parallelize_dataframe(df, func):
num_processes = mp.cpu_count()
df_split = np.array_split(df, num_processes)
with mp.Pool(num_processes) as p:
df = pd.concat(p.map(func, df_split))
return df
def parallelize_function(df):
df[column_output] = df[column_input].apply(example_function)
return df
def example_function(x):
x = x*2
return x
To run:
df_output = parallelize_dataframe(df, parallelize_function)
This worked well for me:
rows_iter = (row for _, row in df.iterrows())
with multiprocessing.Pool() as pool:
df['new_column'] = pool.map(process_apply, rows_iter)
Since I don't have much of your data script, this is a guess, but I'd suggest using p.map instead of apply_async with the callback.
p = mp.Pool(8)
pool_results = p.map(process, np.array_split(big_df,8))
p.close()
p.join()
results = []
for result in pool_results:
results.extend(result)
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
def process_apply(x):
# do some stuff to data here
def process(df):
# spawns a pathos.multiprocessing.ProcessPool if sensible
res = df.mapply(process_apply, axis=1)
return res
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
You could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
I also run into the same problem when I use multiprocessing.map() to apply function to different chunk of a large dataframe.
I just want to add several points just in case other people run into the same problem as I do.
remember to add if __name__ == '__main__':
execute the file in a .py file, if you use ipython/jupyter notebook, then you can not run multiprocessing (this is true for my case, though I have no clue)
Install Pyxtension that simplifies using parallel map and use like this:
from pyxtension.streams import stream
big_df = pd.concat(stream(np.array_split(df, multiprocessing.cpu_count())).mpmap(process))
I ended up using concurrent.futures.ProcessPoolExecutor.map in place of multiprocessing.Pool.map which took 316 microseconds for some code that took 12 seconds in serial.
Python's pool.starmap() method can be used to succinctly introduce parallelism also to apply use cases where column values are passed as arguments, i.e. to cases like:
df.apply(lambda row: my_func(row["col_1"], row["col_2"], ...), axis=1)
Full example and benchmarking:
import time
from multiprocessing import Pool
import numpy as np
import pandas as pd
def mul(a, b, c):
# For illustration, could obviously be vectorized
return a * b * c
df = pd.DataFrame(np.random.randint(0, 100, size=(10_000_000, 3)), columns=list('ABC'))
# Standard apply
start = time.time()
df["mul"] = df.apply(lambda row: mul(row["A"], row["B"], row["C"]), axis=1)
print(f"Standard apply took {time.time() - start:.0f} seconds.")
# Starmap apply
start = time.time()
with Pool(10) as pool:
df["mul_pool"] = pool.starmap(mul, zip(df["A"], df["B"], df["C"]))
print(f"Starmap apply took {time.time() - start:.0f} seconds.")
pd.testing.assert_series_equal(df["mul"], df["mul_pool"], check_names=False)
>>> Standard apply took 72 seconds.
>>> Starmap apply took 5 seconds.
This has the benefit of not relying on external libraries, plus being very readable.
Tom Raz's answer https://stackoverflow.com/a/53135031/11847090 misses an edge case where there are fewer rows in the dataframe than processes
use this parallelize method instead
def parallelize(data, func, num_of_processes=8):
# check if the number of rows is less than the number of processes
# to avoid the following error
# ValueError: Expected a 1D array, got an array with shape
num_rows = len(data)
if num_rows == 0:
return None
elif num_rows < num_of_processes:
num_of_processes = num_rows
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
and also I used dask bag to multithread this instead of this custom code