Using Multiprocessing with Dataframes - python

I have a function that has 4 nestled for loops in it. The function takes in a dataframe and returns a new dataframe. Currently the function takes about 2 hours to run, I need it to run in around 30 mins...
I've tried multiprocessing using 4 cores but I cant seem to get it to work. I start by creating a list of my input dataframe split into smaller chunks (list_of_df)
all_trips = uncov_df.TRIP_NO.unique()
list_of_df = []
for trip in all_trips:
list_of_df.append(uncov_df[uncov_df.TRIP_NO==trip])
I then tried mapping this list of chunks into my function (transform_df) using 4 pools.
from multiprocessing import Pool
if __name__ == "__main__":
with Pool(4) as p:
df_uncov = list(p.map(transform_df, list_of_df))
df = pd.concat(df_uncov)
When I run the above my code cell freezes and nothing happens. Does anyone know what's going on?

This is how I set mine up using starmap. This returns a list of dfs to be concatenated later.
#put this above if __name__ == "__main__":
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as p:
df_list = p.starmap(transform_df, list_of_df)
p.close()
p.join()
return df_list
#then below if __name__ == "__main__":
df_list = get_dflist_multiprocess(list_of_df, num_proc=4) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)

Related

Why append list is slower when using multiprocess?

I want to append list. Each element to be append is a large dataframe.
I try to use Multiprocessing mudule to speed up appending list. My code as follows:
import pandas as pd
import numpy as np
import time
import multiprocessing
from multiprocessing import Manager
def generate_df(size):
df = pd.DataFrame()
for x in list('abcdefghi'):
df[x] = np.random.normal(size=size)
return df
def do_something(df_list,size,k):
df = generate_df(size)
df_list[k] = df
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with Manager() as manager:
df_list = manager.list(range(num_df))
processes = []
for k in range(num_df):
p = multiprocessing.Process(target=do_something, args=(df_list,size,k,))
p.start()
processes.append(p)
for process in processes:
process.join()
final_df = pd.concat(df_list)
print(final_df.head())
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
print(len(final_df))
The elapsed time is 7 secs.
I try to append list without Multiprocessing.
df_list = []
for _ in range(num_df):
df_list.append(generate_df(size))
final_df = pd.concat(df_list)
But, this time the elapsed time is 2 secs! Why append list with multiprocessing is slower than without that?
When you use manager.list, you're not using a normal Python list. You're using a special list proxy object that has a whole lot of other stuff going on. Every operation on that list will involve locking and interprocess communication so that every process with access to the list will see the same data in it at all times. It's slow because it's a non-trivial problem to keep everything consistent in that way.
You probably don't need all of that synchronization, it's just slowing you down. A much more natural way to do what you're attempting is to use a process pool and it's map method. The pool will handle creating and shutting down the processes, and map will call a target function with an argument from an iterable.
Try something like this, which will use a number of worker processes equal to the number of CPUs your system has:
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with multiprocessing.pool() as pool:
df_list = pool.map(generate_df, [size]*num_df)
final_df = pd.concat(df_list)
print(final_df.head())
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
print(len(final_df))
This will still have some overhead, since the interprocess communication used to pass the dataframes back to the main process is not free. It may still be slower than running everything in a single process.
Two points:
Start and retrieve data from subprocess costs data must be transported between processes. This means that if transportation time is more than the time it takes to compute data you don't find benefits. This article can explain better the question.
In your implementation the bottleneck is in the df_list use. The Manager uses lock, this means that the processes are not free to write results into the list df_list

How to increase a function speed being calling 400 time in Python

I have a list named dfs. It contains 400 Pandas dataframes of size 700 rows x 400 columns.
I have a function like this:
def updateDataframe(i):
global dfs
df = dfs[i]
df["abc"].iloc[-1] = "xyz"
df["abc2"] = df["abc"].rolling(10).mean()
........ #More pandas operations like this
dfs[i] = df
for i in range(len(dfs)):
updateDataframe(i)
Now, this loop takes 10 seconds to execute. I have tried python multi-processing, but it takes same time and somtimes even more.
Things I tried:
import multiprocessing.dummy as mp #Multi process Library, used for speeding up download
p=mp.Pool(8) #Define Number of Process to Use
p.map(updateDataframe,range(len(dfs))) # Call the Download Image funciton
p.close() #Close the multi threads
p.join()
Also tried this:
from multiprocessing import Process
if __name__ == "__main__": # confirms that the code is under main function
processes = []
for i in range(len(dfs)):
process = Process(target=updateDataframe, args=(i,))
processes.append(process)
processes.start()
# complete the processes
for i in range(len(processes)):
processes[i].join()

Making an apply function faster in Python

I am running the following code on about 6 million rows. It's so slow and never ends.
df['City'] = df['POSTAL_CODE'].apply(lambda x: nomi.query_postal_code(x).county_name)
It assigns a corresponding city to each postal code. When I run it on a slice of dateset(e.g, 1000 rows) it works well. But running the code on the whole data never gives me any output.
Can anyone modify the code to make it faster?
Thank you!
!pip3 install multiprocess
from multiprocess import Pool
def parallelize_dataframe(data, func, n_cores=4):
data_split = np.array_split(data, n_cores)
pool = Pool(n_cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
df['City'] = parallelize_dataframe(df['POSTAL_CODE'], lambda x: nomi.query_postal_code(x).county_name, 4)

Multiprocessing Doesn't Create Any Extra Processes

I am trying to increase the speed of my program in Python using multiprocessing, but it doesn't actually create any more processes. I've watched a few tutorials but I'm not getting anywhere.
Here it is:
cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.map(OSGBtoETRSfunc(data, eastcol, northcol))
p.close()
p.join()
return output
So to me, this should create 2 processes on a quadcore machine, but it doesn't. My CPU util sits around 18%...
Any insight? It looks the same as the tutorials I have watched... The p.map was not working when listing arguments in square brackets ([]) so I presumed it would need to be in the syntax it is above?
Thanks
I don't clearly understand what do you want, so let's start from simple. The following is a way to simply call the same function over the rows of pd dataframe:
import pandas as pd
import numpy as np
import os
import pathos
from contextlib import closing
NUM_PROCESSES = os.cpu_count()
# create some data frame 100x4
nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))
# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process
# function to be run over rows
# it transforms the given row independently
def foo(idx):
# extract given row to numpy
row = df.iloc[[idx]].values[0]
# you can pass ranges:
# df[2:3]
# transform row
# I return it as list for simplicity of creating dataframe
row = np.exp(row)
# return numpy row
return row
# run pool over range of indexes (0,1, ... , nrow-1)
# and close it afterwars
# there is not reason here to have more workers than number of CPUs
with closing(pathos.multiprocessing.Pool(processes=NUM_PROCESSES)) as pool:
results = pool.map(foo, range(nrow))
# create new dataframe from all those numpy slices:
col_names = df.columns.values.tolist()
df_new = pd.DataFrame(np.array(results), columns=col_names)
What in your computation needs more complicated setup?
EDIT: Ok, here is running two functions concurrently (I am not much familiar with pandas, so just switch to numpy):
# RUNNING TWO FUNCTIONS SIMLTANEOUSLY
import pandas as pd
import numpy as np
from multiprocessing import Process, Queue
# create some data frame 100x4
nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))
# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process
# function to be run over part1 independently
def proc_func1(q1):
# get data from queue1
data1 = q1.get()
# I extract given data to numpy
data_numpy = data1.values
# do something
data_numpy_new = data_numpy + 1
# return numpy array to queue 1
q1.put(data_numpy_new)
return
# function to be run over part2 independently
def proc_func2(q2):
# get data from queue2
data2 = q2.get()
# I extract given data to numpy
data_numpy = data2.values
# do something
data_numpy_new = data_numpy - 1
# return numpy array to queue 2
q2.put(data_numpy_new)
return
# instantiate queues
q1 = Queue()
q2 = Queue()
# divide data frame into two parts
part1 = df[:50]
part2 = df[50:]
# send data, so it will already be in queries
q1.put(part1)
q2.put(part2)
# start two processes
p1 = Process(target=proc_func1, args=(q1,))
p2 = Process(target=proc_func2, args=(q2,))
p1.start()
p2.start()
# wait until they finish
p1.join()
p2.join()
# read results from Queues
res1 = q1.get()
res2 = q2.get()
if (res1 is None) or (res2 is None):
print('Error!')
# reassemble two results back to single dataframe (might be inefficient)
col_names = df.columns.values.tolist()
# concatenate results along x axis
df_new = pd.DataFrame(np.concatenate([np.array(res1), np.array(res2)], axis=0), columns=col_names)
In Python you should provide the function and the arguments separated. If not, you are executing the function OSGBtoETRSfunc at the time of creating the process. Instead, you should provide the pointer to the function, and a list with the arguments.
Your case is similar to the one shown on Python Docs: https://docs.python.org/3.7/library/multiprocessing.html#introduction
Anyway, I think you are using the wrong function. Pool.map() works as map: on a list of items and applies the same function to each item. I think that your function OSGBtoERTSfunc needs the three params in order to work properly. Please, instead of using p.map(), use p.apply()
cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.apply(OSGBtoETRSfunc, [data, eastcol, northcol])
p.close()
p.join()
return output

sending each looped pandas calculation to a different thread (python3.6.5) with pool.map

with a basic pandas df of financial market OHLCV data, I am trying to add numerous calculated columns to the df. The large number of columns and calculations is making this SLOW SLOW SLOW!
Trying to multiprocess with pool.map, but getting nowhere.
Ideally, each iteration of the loop should be sent to a discrete thread. Simplified moving averages in code below.
Shown simple dictionary and rolling mean works SLOWLY
TypeError: map() missing 1 required positional argument: 'iterable'
All help appreciated-thx
import pandas as pd
from multiprocessing.dummy import Pool as ThreadPool
#####################################################
# DJIA_OHLCV_test.csv has format:
# Date,Open,High,Low,Close,Adj Close,Volume
#
1/2/2015,17823.07031,17951.7793,17731.30078,17832.99023,17832.99023,76270000
#
1/3/2015,17823.07031,17951.7793,17731.30078,17832.99023,17832.99023,76270000
DJIA = pd.read_csv('DJIA_OHLCV_test.csv')
"""
#####################################################
# # This works! please comment out to switch
# MAdict = {'MA50':50, 'MA100':100, 'MA200':200} # Define Moving Average
Windows
# for MAkey in MAdict:
# DJIA[('ma' + MAkey)] = pd.Series.rolling(DJIA['Adj Close'],
window=MAdict[MAkey]).mean()
#####################################################
"""
# This doesn't work! please comment out to switch
MAdict = {'MA50':50, 'MA100':100, 'MA200':200}
pool = ThreadPool(3)
def moving_average(MAkey):
return pd.Series.rolling(DJIA['Adj Close'], window=MAdict[MAkey]).mean()
for MAkey in MAdict:
DJIA[('ma' + MAkey)] = pool.map(moving_average(MAkey))
#####################################################
print(DJIA.tail())
pool.map is a blocking call, so instead of iterating over MAdict and calling pool.map you need to pass the iterable directly as an argument to pool.map:
import pandas as pd
from multiprocessing.dummy import Pool
def moving_average(ma):
return pd.Series.rolling(djia['Adj Close'], window=ma).mean()
if __name__ == '__main__':
N_WORKERS = 3
MA_DICT = {'MA50':50, 'MA100':100, 'MA200':200}
djia = pd.read_csv('DJIA_OHLCV_test.csv')
with Pool(N_WORKERS) as pool:
results = pool.map(moving_average, iterable=MA_DICT.values())
# concatenate results and rename columns
results = pd.concat(results, axis=1)
results.columns = ['ma' + key for key in MA_DICT]
djia = pd.concat([djia, results], axis=1)
print(djia.tail())

Categories