Multiprocessing Doesn't Create Any Extra Processes - python

I am trying to increase the speed of my program in Python using multiprocessing, but it doesn't actually create any more processes. I've watched a few tutorials but I'm not getting anywhere.
Here it is:
cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.map(OSGBtoETRSfunc(data, eastcol, northcol))
p.close()
p.join()
return output
So to me, this should create 2 processes on a quadcore machine, but it doesn't. My CPU util sits around 18%...
Any insight? It looks the same as the tutorials I have watched... The p.map was not working when listing arguments in square brackets ([]) so I presumed it would need to be in the syntax it is above?
Thanks

I don't clearly understand what do you want, so let's start from simple. The following is a way to simply call the same function over the rows of pd dataframe:
import pandas as pd
import numpy as np
import os
import pathos
from contextlib import closing
NUM_PROCESSES = os.cpu_count()
# create some data frame 100x4
nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))
# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process
# function to be run over rows
# it transforms the given row independently
def foo(idx):
# extract given row to numpy
row = df.iloc[[idx]].values[0]
# you can pass ranges:
# df[2:3]
# transform row
# I return it as list for simplicity of creating dataframe
row = np.exp(row)
# return numpy row
return row
# run pool over range of indexes (0,1, ... , nrow-1)
# and close it afterwars
# there is not reason here to have more workers than number of CPUs
with closing(pathos.multiprocessing.Pool(processes=NUM_PROCESSES)) as pool:
results = pool.map(foo, range(nrow))
# create new dataframe from all those numpy slices:
col_names = df.columns.values.tolist()
df_new = pd.DataFrame(np.array(results), columns=col_names)
What in your computation needs more complicated setup?
EDIT: Ok, here is running two functions concurrently (I am not much familiar with pandas, so just switch to numpy):
# RUNNING TWO FUNCTIONS SIMLTANEOUSLY
import pandas as pd
import numpy as np
from multiprocessing import Process, Queue
# create some data frame 100x4
nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))
# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process
# function to be run over part1 independently
def proc_func1(q1):
# get data from queue1
data1 = q1.get()
# I extract given data to numpy
data_numpy = data1.values
# do something
data_numpy_new = data_numpy + 1
# return numpy array to queue 1
q1.put(data_numpy_new)
return
# function to be run over part2 independently
def proc_func2(q2):
# get data from queue2
data2 = q2.get()
# I extract given data to numpy
data_numpy = data2.values
# do something
data_numpy_new = data_numpy - 1
# return numpy array to queue 2
q2.put(data_numpy_new)
return
# instantiate queues
q1 = Queue()
q2 = Queue()
# divide data frame into two parts
part1 = df[:50]
part2 = df[50:]
# send data, so it will already be in queries
q1.put(part1)
q2.put(part2)
# start two processes
p1 = Process(target=proc_func1, args=(q1,))
p2 = Process(target=proc_func2, args=(q2,))
p1.start()
p2.start()
# wait until they finish
p1.join()
p2.join()
# read results from Queues
res1 = q1.get()
res2 = q2.get()
if (res1 is None) or (res2 is None):
print('Error!')
# reassemble two results back to single dataframe (might be inefficient)
col_names = df.columns.values.tolist()
# concatenate results along x axis
df_new = pd.DataFrame(np.concatenate([np.array(res1), np.array(res2)], axis=0), columns=col_names)

In Python you should provide the function and the arguments separated. If not, you are executing the function OSGBtoETRSfunc at the time of creating the process. Instead, you should provide the pointer to the function, and a list with the arguments.
Your case is similar to the one shown on Python Docs: https://docs.python.org/3.7/library/multiprocessing.html#introduction
Anyway, I think you are using the wrong function. Pool.map() works as map: on a list of items and applies the same function to each item. I think that your function OSGBtoERTSfunc needs the three params in order to work properly. Please, instead of using p.map(), use p.apply()
cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.apply(OSGBtoETRSfunc, [data, eastcol, northcol])
p.close()
p.join()
return output

Related

Why append list is slower when using multiprocess?

I want to append list. Each element to be append is a large dataframe.
I try to use Multiprocessing mudule to speed up appending list. My code as follows:
import pandas as pd
import numpy as np
import time
import multiprocessing
from multiprocessing import Manager
def generate_df(size):
df = pd.DataFrame()
for x in list('abcdefghi'):
df[x] = np.random.normal(size=size)
return df
def do_something(df_list,size,k):
df = generate_df(size)
df_list[k] = df
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with Manager() as manager:
df_list = manager.list(range(num_df))
processes = []
for k in range(num_df):
p = multiprocessing.Process(target=do_something, args=(df_list,size,k,))
p.start()
processes.append(p)
for process in processes:
process.join()
final_df = pd.concat(df_list)
print(final_df.head())
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
print(len(final_df))
The elapsed time is 7 secs.
I try to append list without Multiprocessing.
df_list = []
for _ in range(num_df):
df_list.append(generate_df(size))
final_df = pd.concat(df_list)
But, this time the elapsed time is 2 secs! Why append list with multiprocessing is slower than without that?
When you use manager.list, you're not using a normal Python list. You're using a special list proxy object that has a whole lot of other stuff going on. Every operation on that list will involve locking and interprocess communication so that every process with access to the list will see the same data in it at all times. It's slow because it's a non-trivial problem to keep everything consistent in that way.
You probably don't need all of that synchronization, it's just slowing you down. A much more natural way to do what you're attempting is to use a process pool and it's map method. The pool will handle creating and shutting down the processes, and map will call a target function with an argument from an iterable.
Try something like this, which will use a number of worker processes equal to the number of CPUs your system has:
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with multiprocessing.pool() as pool:
df_list = pool.map(generate_df, [size]*num_df)
final_df = pd.concat(df_list)
print(final_df.head())
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
print(len(final_df))
This will still have some overhead, since the interprocess communication used to pass the dataframes back to the main process is not free. It may still be slower than running everything in a single process.
Two points:
Start and retrieve data from subprocess costs data must be transported between processes. This means that if transportation time is more than the time it takes to compute data you don't find benefits. This article can explain better the question.
In your implementation the bottleneck is in the df_list use. The Manager uses lock, this means that the processes are not free to write results into the list df_list

How to increase a function speed being calling 400 time in Python

I have a list named dfs. It contains 400 Pandas dataframes of size 700 rows x 400 columns.
I have a function like this:
def updateDataframe(i):
global dfs
df = dfs[i]
df["abc"].iloc[-1] = "xyz"
df["abc2"] = df["abc"].rolling(10).mean()
........ #More pandas operations like this
dfs[i] = df
for i in range(len(dfs)):
updateDataframe(i)
Now, this loop takes 10 seconds to execute. I have tried python multi-processing, but it takes same time and somtimes even more.
Things I tried:
import multiprocessing.dummy as mp #Multi process Library, used for speeding up download
p=mp.Pool(8) #Define Number of Process to Use
p.map(updateDataframe,range(len(dfs))) # Call the Download Image funciton
p.close() #Close the multi threads
p.join()
Also tried this:
from multiprocessing import Process
if __name__ == "__main__": # confirms that the code is under main function
processes = []
for i in range(len(dfs)):
process = Process(target=updateDataframe, args=(i,))
processes.append(process)
processes.start()
# complete the processes
for i in range(len(processes)):
processes[i].join()

sending each looped pandas calculation to a different thread (python3.6.5) with pool.map

with a basic pandas df of financial market OHLCV data, I am trying to add numerous calculated columns to the df. The large number of columns and calculations is making this SLOW SLOW SLOW!
Trying to multiprocess with pool.map, but getting nowhere.
Ideally, each iteration of the loop should be sent to a discrete thread. Simplified moving averages in code below.
Shown simple dictionary and rolling mean works SLOWLY
TypeError: map() missing 1 required positional argument: 'iterable'
All help appreciated-thx
import pandas as pd
from multiprocessing.dummy import Pool as ThreadPool
#####################################################
# DJIA_OHLCV_test.csv has format:
# Date,Open,High,Low,Close,Adj Close,Volume
#
1/2/2015,17823.07031,17951.7793,17731.30078,17832.99023,17832.99023,76270000
#
1/3/2015,17823.07031,17951.7793,17731.30078,17832.99023,17832.99023,76270000
DJIA = pd.read_csv('DJIA_OHLCV_test.csv')
"""
#####################################################
# # This works! please comment out to switch
# MAdict = {'MA50':50, 'MA100':100, 'MA200':200} # Define Moving Average
Windows
# for MAkey in MAdict:
# DJIA[('ma' + MAkey)] = pd.Series.rolling(DJIA['Adj Close'],
window=MAdict[MAkey]).mean()
#####################################################
"""
# This doesn't work! please comment out to switch
MAdict = {'MA50':50, 'MA100':100, 'MA200':200}
pool = ThreadPool(3)
def moving_average(MAkey):
return pd.Series.rolling(DJIA['Adj Close'], window=MAdict[MAkey]).mean()
for MAkey in MAdict:
DJIA[('ma' + MAkey)] = pool.map(moving_average(MAkey))
#####################################################
print(DJIA.tail())
pool.map is a blocking call, so instead of iterating over MAdict and calling pool.map you need to pass the iterable directly as an argument to pool.map:
import pandas as pd
from multiprocessing.dummy import Pool
def moving_average(ma):
return pd.Series.rolling(djia['Adj Close'], window=ma).mean()
if __name__ == '__main__':
N_WORKERS = 3
MA_DICT = {'MA50':50, 'MA100':100, 'MA200':200}
djia = pd.read_csv('DJIA_OHLCV_test.csv')
with Pool(N_WORKERS) as pool:
results = pool.map(moving_average, iterable=MA_DICT.values())
# concatenate results and rename columns
results = pd.concat(results, axis=1)
results.columns = ['ma' + key for key in MA_DICT]
djia = pd.concat([djia, results], axis=1)
print(djia.tail())

Huge quantity of Dask computations causing memory issues

I am working on a task where I need to determine where two geospatial points are within 250 meters of each other and occur within 20 minutes of each other. My data set is approximately 1.2M rows and 10 columns. So, I need to determine a distance, time difference, and whether they meet my criteria by going through 1.2M**2 calculations.
I have been able to run the code below where I create 10,000 Dask objects to compute without problem. However, when I attempt to test 100,000 objects Dask runs up against memory limitations and I see significant CPU usage for swap. To be clear, I'm running this on a 32 core node with 125 GB of memory.
Admittedly, I'm quite new to Dask, so I'd like to know: is there a better way to solve this problem than processing in 10,000 row chunks?
#!/usr/bin/env python
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.array import sqrt
import time
import multiprocessing as mp
df = pd.read_hdf(...) # Used to select single item for comparison
ddf = dd.read_hdf(...) # Used for Dask operations
def distCheck(item,df=ddf):
'''
Determine if any records in df are within 250m of item and within 20
minutes of item. Return Dask object for calculation.
'''
dist = sqrt(((ddf.LCC_x1-item.LCC_x1)**2+(ddf.LCC_y1-item.LCC_y1)**2))
distcrit = dist[dist < 250]
delta = (ddf.Date - item.Date).abs()
timecrit = delta[delta < np.timedelta64(20,'m')]
res1 = ddf.copy()
res1['dist'] = dist
res1['delta'] = delta
res1 = res1.loc[(distcrit.index) & (timecrit.index) & (idcrit.index)]
res1['MatchMMSI'] = item.MMSI
res1['MatchVoy'] = item.Voyage
out = res1
return out
def getDaskCalls(start,stop):
'''
Get Dask objects to assess temporal and spatial proximity for df
indices from start to stop.
'''
# Kick off multiprocessing pool, submit, and close
pool = mp.Pool(processes=32)
daskers = []
for i in range(start,stop):
result = pool.apply_async(distCheck,args=(df.iloc[i,:],ddf,))
daskers.append(result)
dasky = [i.get() for i in daskers]
pool.close()
return dasky
def runDask(calls):
result = pd.DataFrame([],columns=calls[0].columns)
output = dd.compute(calls)
result = pd.concat([result]+[i for i in output[0] if i.shape[0] != 0])
return result
###
### Process
###
# Get initial timestamp
start = time.time()
# Create Dask Calls & determine duration
dcalls = getDaskCalls(0,10000)
callsCreated = time.time()
# Print time required to create calls
print("Dask Calls Created.")
print(callsCreated-start)
# Compute the calls with Dask
print("Computing...")
result = runDask(dcalls)
# Print the time for computation
computation = time.time()
print(" ...Done.")
print(computation-callsCreated)

pandas multiprocessing apply

I'm trying to use multiprocessing with pandas dataframe, that is split the dataframe to 8 parts. apply some function to each part using apply (with each part processed in different process).
EDIT:
Here's the solution I finally found:
import multiprocessing as mp
import pandas.util.testing as pdt
def process_apply(x):
# do some stuff to data here
def process(df):
res = df.apply(process_apply, axis=1)
return res
if __name__ == '__main__':
p = mp.Pool(processes=8)
split_dfs = np.array_split(big_df,8)
pool_results = p.map(aoi_proc, split_dfs)
p.close()
p.join()
# merging parts processed by different processes
parts = pd.concat(pool_results, axis=0)
# merging newly calculated parts to big_df
big_df = pd.concat([big_df, parts], axis=1)
# checking if the dfs were merged correctly
pdt.assert_series_equal(parts['id'], big_df['id'])
You can use https://github.com/nalepae/pandarallel, as in the following example:
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
A more generic version based on the author solution, that allows to run it on every function and dataframe:
from multiprocessing import Pool
from functools import partial
import numpy as np
def parallelize(data, func, num_of_processes=8):
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
def run_on_subset(func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(data, func, num_of_processes=8):
return parallelize(data, partial(run_on_subset, func), num_of_processes)
So the following line:
df.apply(some_func, axis=1)
Will become:
parallelize_on_rows(df, some_func)
This is some code that I found useful. Automatically splits the dataframe into however many cpu cores you have.
import pandas as pd
import numpy as np
import multiprocessing as mp
def parallelize_dataframe(df, func):
num_processes = mp.cpu_count()
df_split = np.array_split(df, num_processes)
with mp.Pool(num_processes) as p:
df = pd.concat(p.map(func, df_split))
return df
def parallelize_function(df):
df[column_output] = df[column_input].apply(example_function)
return df
def example_function(x):
x = x*2
return x
To run:
df_output = parallelize_dataframe(df, parallelize_function)
This worked well for me:
rows_iter = (row for _, row in df.iterrows())
with multiprocessing.Pool() as pool:
df['new_column'] = pool.map(process_apply, rows_iter)
Since I don't have much of your data script, this is a guess, but I'd suggest using p.map instead of apply_async with the callback.
p = mp.Pool(8)
pool_results = p.map(process, np.array_split(big_df,8))
p.close()
p.join()
results = []
for result in pool_results:
results.extend(result)
To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
def process_apply(x):
# do some stuff to data here
def process(df):
# spawns a pathos.multiprocessing.ProcessPool if sensible
res = df.mapply(process_apply, axis=1)
return res
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
You could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)
I also run into the same problem when I use multiprocessing.map() to apply function to different chunk of a large dataframe.
I just want to add several points just in case other people run into the same problem as I do.
remember to add if __name__ == '__main__':
execute the file in a .py file, if you use ipython/jupyter notebook, then you can not run multiprocessing (this is true for my case, though I have no clue)
Install Pyxtension that simplifies using parallel map and use like this:
from pyxtension.streams import stream
big_df = pd.concat(stream(np.array_split(df, multiprocessing.cpu_count())).mpmap(process))
I ended up using concurrent.futures.ProcessPoolExecutor.map in place of multiprocessing.Pool.map which took 316 microseconds for some code that took 12 seconds in serial.
Python's pool.starmap() method can be used to succinctly introduce parallelism also to apply use cases where column values are passed as arguments, i.e. to cases like:
df.apply(lambda row: my_func(row["col_1"], row["col_2"], ...), axis=1)
Full example and benchmarking:
import time
from multiprocessing import Pool
import numpy as np
import pandas as pd
def mul(a, b, c):
# For illustration, could obviously be vectorized
return a * b * c
df = pd.DataFrame(np.random.randint(0, 100, size=(10_000_000, 3)), columns=list('ABC'))
# Standard apply
start = time.time()
df["mul"] = df.apply(lambda row: mul(row["A"], row["B"], row["C"]), axis=1)
print(f"Standard apply took {time.time() - start:.0f} seconds.")
# Starmap apply
start = time.time()
with Pool(10) as pool:
df["mul_pool"] = pool.starmap(mul, zip(df["A"], df["B"], df["C"]))
print(f"Starmap apply took {time.time() - start:.0f} seconds.")
pd.testing.assert_series_equal(df["mul"], df["mul_pool"], check_names=False)
>>> Standard apply took 72 seconds.
>>> Starmap apply took 5 seconds.
This has the benefit of not relying on external libraries, plus being very readable.
Tom Raz's answer https://stackoverflow.com/a/53135031/11847090 misses an edge case where there are fewer rows in the dataframe than processes
use this parallelize method instead
def parallelize(data, func, num_of_processes=8):
# check if the number of rows is less than the number of processes
# to avoid the following error
# ValueError: Expected a 1D array, got an array with shape
num_rows = len(data)
if num_rows == 0:
return None
elif num_rows < num_of_processes:
num_of_processes = num_rows
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
and also I used dask bag to multithread this instead of this custom code

Categories