Multiprocessing with large iterable - python

I have two large Pandas Dataframes (1GB+) with data that needs to be processed by multiple workers.
I'm able to perform the operations without issues in a toy example with much smaller Dataframes (DFs).
Below is my reproducible example.
I've tried several routes:
I am unable to take advantage of chunk. The DFs need to be sliced into specific pieces on an index before each piece is fed to the workers. And chunk can only slice them to an arbitrary length.
Using starmap: This is what you see in the code below. Pre-slicing the DFs on the indexes and storing the pieces in an iterable. The pieces can be passed as small frames (or dicts) to the worker processes. This solution is not feasible at the sizes of my real DFs- the iterable never finishes being created. I've tried and failed using a generator/ yield for starmap. I would appreciate an example of a workaround if this is an option.
Using imap: The entire input DFs end up going to each of the worker processes. I was able to use generators/yield through an intermediate function that would slice the DFs and make them available for each worker without having a huge iterable. But the process was taking longer than if I'd use a for loop. The overhead of passing the data to the workers was the bottleneck.
I am ready to conclude that multiprocessing cannot be applied to a large data table that needs to be sliced before being sent to workers.
import random
import numpy as np
import pandas as pd
import multiprocessing
def df_slicer(df1,df2):
while len(df1)>0:
value_to_slice_df_by=df1.iloc[0]['Number'] #Will use first found number as the index for this process.
a= df1[df1['Number']==value_to_slice_df_by].copy()
b= df2[df2['Number']==value_to_slice_df_by].copy()
print('len(df1): {}, len(df2): {}'.format(len(a),len(b)))
return[a,b]
if __name__ == '__main__':
#These are large and will be pulled in from Pandas pickles.
df1=pd.DataFrame(np.random.randint(3000, size=(100000, 2)), columns=['Number', 'Info_df1'])
df2=pd.DataFrame(np.random.randint(5000, size=(100000, 2)), columns=['Number', 'Info_df2'])
iterable= [[df1.loc[df1['Number']==i], df2.loc[df2['Number']==i]] for i in list(np.unique(df1['Number']))]
pool = multiprocessing.Pool( processes=multiprocessing.cpu_count() - 1)
for res in pool.starmap(df_slicer, [[i[0],i[1]] for i in iterable]):
result=res
pass
print('Done')
pool.close(); pool.join()

Related

write multi dimensional numpy array to many files

I was wondering if there was a more efficient way of doing the following without using loops.
I have a numpy array with the shape (i, x, y, z). Essentially I have i elements of the shape (x, y, z).
I want to write each element to a separate file so that I have i files, each with the data from a single element.
In my case, each element is an image, but I'm sure a solution can be format agnostic.
I'm currently looping through each of the i elements and writing them out one at a time.
As i gets really large, this takes a progressively longer time. Is there a better way or a useful library which could make this more efficient?
Update
I tried the suggestion to use multiprocessing by using concurrent.futures both the thread pool and then also trying the process pool. It was simpler in the code but the time to complete was 4x slower.
i in this case is approximately 10000 while x and y are approximately 750
This sounds very suitable for multiprocessing, as the different elements need to be processed separately and can be save to disk independantly.
Python has a usefull package for this, called multiprocessing, with a variety of pooling, processing, and other options.
Here's a simple (and comment-documented) example of usage:
from multiprocessing import Process
import numpy as np
# This should be your existing function
def write_file(element):
# write file
pass
# You'll still be looping of course, but in parallel over batches. This is a helper function for looping over a "batch"
def write_list_of_files(elements_list):
for element in elements_list:
write_file(element)
# You're data goes here...
all_elements = np.ones((1000, 256, 256, 3))
num_procs = 10 # Depends on system limitations, number of cpu-cores, etc.
procs = [Process(target=write_list_of_files, args=[all_elements[k::num_procs, ...]]) for k in range(num_procs)] # Each of these processes in the list is going to run the "write_list_of_files" function, but have separate inputs, due to the indexing trick of using "k::num_procs"...
for p in procs:
p.start() # Each process starts running independantly
for p in procs:
p.join() # assures the code won't continue until all are "joined" and done. Optional obviously...
print('All done!') # This only runs onces all procs are done, due to "p.join"

Python Multiprocessing: Writing to file every k iterations

I am using the multiprocessing module in python 3.7 to call a function repeatedly in parallel. I would like to write the results out to a file every k iterations. (It can be a different file each time.)
Below is my first attempt, which basically loops over sets of function arguments, running each set in parallel and writing the results to a file before moving onto the next set. This is obviously very inefficient. In practice, the time it takes for my function to run is much longer and varies depending on the input values, so many processors sit idle between iterations of the loop.
Is there a more efficient way to achieve this?
import multiprocessing as mp
import numpy as np
import pandas as pd
def myfunction(x): # toy example function
return(x**2)
for start in np.arange(0,500,100):
with mp.Pool(mp.cpu_count()) as pool:
out = pool.map(myfunction, np.arange(start, start+100))
pd.DataFrame(out).to_csv('filename_'+str(start//100+1)+'.csv', header=False, index=False)
My first comment is that if myfunction is a trivial as the one you have shown, then your performance will be worse using multiprocessing because there is overhead in creating the process pool (which by the way you are unnecessarily creating over and over in each loop iteration) and passing arguments from one process to another.
Assuming that myfunction is pure CPU and after map has returned 100 values there is an opportunity to overlap the writing of the CSV files that you are not taking advantage of (it's not clear how much performance will be improved by concurrent disk writing; it depends on the type of drive you have, head movement, etc.), then a combination of multithreading and multiprocessing could be the solution. The number of processes in your processing pool will be limited to the number of CPU cores given the assumption that myfunction is 100% CPU and does not release the Global Interpreter Lock and therefore cannot take advantage of a pool size greater than the number of CPUs you have. Anyway, that is my assumption. If you are going to be using certain numpy functions for example, then that is an erroneous assumption. On the other hand, it is known that numpy uses multiprocessing for some of its own processing in which case the combination of using numpy and your own multiprocessing could result in worse performance. Your current code is only using numpy for generating ranges. This seems to be a bit of overkill as there are other means of generating ranges. I have taken the liberty of generating the ranges in a slightly different fashion by defining START and STOP values and N_SPLITS, the number of equal (or as equally as possible) divisions of this range as possible and generating tuples of start and stop values that can be converted into ranges. I hope this is not too confusing. But this seemed to be a more flexible approach.
In the following code both a thread pool and a processing pool are created. The tasks are submitted to the thread pool with one of the arguments being the processing pool, whish is used by the worker to do the CPU intensive calculations and then when the results have been assembled the worker writes out the CSV file.
from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
import pandas as pd
def worker(process_pool, index, split_range):
out = process_pool.map(myfunction, range(*split_range))
pd.DataFrame(out).to_csv(f'filename_{index}.csv', header=False, index=False)
def myfunction(x): # toy example function
return(x ** 2)
def split(start, stop, n):
k, m = divmod(stop - start, n)
return [(i * k + min(i, m),(i + 1) * k + min(i + 1, m)) for i in range(n)]
def main():
RANGE_START = 0
RANGE_STOP = 500
N_SPLITS = 5
n_processes = min(N_SPLITS, cpu_count())
split_ranges = split(RANGE_START, RANGE_STOP, N_SPLITS) # [(0, 100), (100, 200), ... (400, 500)]
process_pool = Pool(n_processes)
thread_pool = ThreadPool(N_SPLITS)
for index, split_range in enumerate(split_ranges):
thread_pool.apply_async(worker, args=(process_pool, index, split_range))
# wait for all threading tasks to complete:
thread_pool.close()
thread_pool.join()
# required for Windows:
if __name__ == '__main__':
main()

Dask - Is it possible to use all threads in every worker with custom function?

In my case I have several files in S3 and a custom function that read each one of them and process it using all threads. To simplify the example I just generate a dataframe df and I assume that my function is tsfresh.extract_features which use multiprocessing.
Generate Data
import pandas as pd
from tsfresh import extract_features
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, \
load_robot_execution_failures
download_robot_execution_failures()
ts, y = load_robot_execution_failures()
df = []
for i in range(5):
tts = ts.copy()
tts["id"] += 88 * i
df.append(tts)
df = pd.concat(df, ignore_index=True)
Function
def fun(df, n_jobs):
extracted_features = extract_features(df,
column_id="id",
column_sort="time",
n_jobs=n_jobs)
Cluster
import dask
from dask.distributed import Client, progress
from dask import compute, delayed
from dask_cloudprovider import FargateCluster
my_vpc = # your vpc
my_subnets = # your subnets
cpu = 2
ram = 4
cluster = FargateCluster(n_workers=1,
image='rpanai/feats-worker:2020-08-24',
vpc=my_vpc,
subnets=my_subnets,
worker_cpu=int(cpu * 1024),
worker_mem=int(ram * 1024),
cloudwatch_logs_group="my_log_group",
task_role_policies=['arn:aws:iam::aws:policy/AmazonS3FullAccess'],
scheduler_timeout='20 minutes'
)
cluster.adapt(minimum=1,
maximum=4)
client = Client(cluster)
client
Using all worker threads (FAIL)
to_process = [delayed(fun)(df, cpu) for i in range(10)]
out = compute(to_process)
AssertionError: daemonic processes are not allowed to have children
Using only one thread (OK)
In this case it works fine but I'm wasting resources.
to_process = [delayed(fun)(df, 0) for i in range(10)]
out = compute(to_process)
Question
I know that for this particular function I could eventually write a custom distributor using multithreading and few other tricks but I'd like to distribute a job where on every worker I can take advantages of all resources without having to worry too much.
Update
The function was just an example and actually it has some sort of cleaning before the actual feature extraction and after it save it to S3.
def fun(filename, bucket_name, filename_out, n_jobs):
#
df pd.read_parquet(f"s3://{bucket_name}/{filename}")
# do some cleaning
extracted_features = extract_features(df,
column_id="id",
column_sort="time",
n_jobs=n_jobs)
extract_features.to_parquet(f"s3://{bucket_name}/{filename_out}")
I can help answering your specific question for tsfresh, but if tsfresh was just a simple toy example that might not be what you want.
For tsfresh, you would typically not mix the multiprocessing of tsfresh and dask, but let dask do all the handling. This means, you start with a single dask.DataFrame (in your test case, you could just convert the pandas dataframe into a dask one - for your read use case you can read directly from S3 docu), and then distribute the feature extraction in the dask dataframe (the nice thing on the feature extraction is, that it works independently on every time series. Therefore we can generate a single job for every time series).
The current version of tsfresh (0.16.0) has a small helper function that will do this for you: see here.
In the next version, it might even be possible to just run extract_features on the dask dataframe directly.
I am not sure if this helps to solve your more general question. In my opinion, you (in most of the cases) do not want to mix dask's distribution function and "local" multicore calculation but just let dask handle everything. Because if you are on a dask cluster, you might not even know how many cores you will have on each of the machines (or you might only get a single one per job).
This means if your job can be distributed N times and each of them will start M sub-jobs, you just give "N x M" jobs to dask and let it figure out the rest (including data locality).

Python - Running function concurrently (multiple instance)

I built a little function that will gather some data using a 3rd party API. Call if def MyFunc(Symbol, Field) that will return some info based on the symbol given.
The idea was to fill a Pandas df with the returned value using something like:
df['MyNewField'] = df.apply(lamba x: MyFunc(x, 'FieldName'))
All this works BUT, each query takes around 100ms to run. This seems fast until you realize you may have 30,000 or more to do (3,000 Symbols with 10 fields each for starters).
I was wondering if there would be a way to run this concurrently as each request is independent? I am not looking for multi processor etc. libraries but instead a way to do multiple queries to the 3rd party at the same time to reduce the time taken to gather all the data. (Also, I suppose this will change the initial structure used to store all the received data - I do not mind not using Apply and my dataframe at first and instead save the data as it is received on a text or library type structure -).
NOTE: While I wish I could change MyFunc to request multiple symbols/fields at once this cannot be done for all cases (meaning some fields do not allow that and a single request is the only way to go). This is why I am looking at concurrent execution and not at changing MyFunc.
Thanks!
There are many libraries to parallelize pandas dataframe. However, I prefer native multi-processing pool to do the same. Also, I use tqdm along with it to know the progress.
import numpy as np
from multiprocessing import cpu_count, Pool
cores = 4 #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def partition(data, num_partitions):
partition_len = int(len(data)/num_partitions)
partitions = []
num_rows = 0
for i in range(num_partitions-1):
partition = data.iloc[i*partition_len:i*partition_len+partition_len]
num_rows = num_rows + partition_len
partitions.append(partition)
partitions.append(data.iloc[num_rows:len(data)])
return partitions
def parallelize(data, func):
data_split = partition(data, partitions)
pool = Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
df['MyNewField'] = parallelize(df['FieldName'], MyFunc)

python or dask parallel generator?

Is it possible in python (maybe using dask, maybe using multiprocessing) to 'emplace' generators on cores, and then, in parallel, step through the generators and process the results?
It needs to be generators in particular (or objects with __iter__); lists of all the yielded elements the generators yield won't fit into memory.
In particular:
With pandas, I can call read_csv(...iterator=True), which gives me an iterator (TextFileReader) - I can for in it or explicitly call next multiple times. The entire csv never gets read into memory. Nice.
Every time I read a next chunk from the iterator, I also perform some expensive computation on it.
But now I have 2 such files. I would like to create 2 such generators, and 'emplace' 1 on one core and 1 on another, such that I can:
result = expensive_process(next(iterator))
on each core, in parallel, and then combine and return the result. Repeat this step until one generator or both is out of yield.
It looks like the TextFileReader is not pickleable, nor is a generator. I can't find out how to do this in dask or multiprocessing. Is there a pattern for this?
Dask's read_csv is designed to load data from multiple files in chunks, with a chunk-size that you can specify. When you operate on the resultant dataframe, you will be working chunk-wise, which is exactly the point of using Dask in the first place. There should be no need to use your iterator method.
The dask dataframe method you will want to use, most likely, is map_partitions().
If you really wanted to use the iterator idea, you should look into dask.delayed, which is able to parallelise arbitrary python functions, by sending each invocation of the function (with a different file-name for each) to your workers.
So luckily I think this problem maps nicely onto python's multiprocessing .Process and .Queue.
def data_generator(whatever):
for v in something(whatever):
yield v
def generator_constructor(whatever):
def generator(outputQueue):
for d in data_generator(whatever):
outputQueue.put(d)
outputQueue.put(None) # sentinel
return generator
def procSumGenerator():
outputQs = [Queue(size) for _ in range(NumCores)]
procs = [Process(target=generator_constructor(whatever),
args=(outputQs[i],))
for i in range(NumCores)]
for proc in procs: proc.start()
# until any output queue returns a None, collect
# from all and yield
done = False
while not done:
results = [oq.get() for oq in outputQs]
done = any(res is None for res in results)
if not done:
yield some_combination_of(results)
for proc in procs: terminate()
for v in procSumGenerator():
print(v)
Maybe this can be done better with Dask? I find that my solution fairly quickly saturates the network for large sizes of generated data - I'm manipulating csvs with pandas and returning large numpy arrays.
https://github.com/colinator/doodle_generator/blob/master/data_generator_uniform_final.ipynb

Categories