I'm new to python multiprocessing. I'm trying to read excel file with enormous file size. When reading big files it takes a long time and I just found out about Multiprocessing It is the technology that allows your program to run in parallel by using multiple CPU cores at the same time. It is used to significantly speed up your program I'm trying to use it but when I read it via Pool, it returns error no such file or directory but without pool (just call the function with the parameter) it doesn't raise error.
import os
import pandas as pd
import multiprocessing as mp
def read_csv(filename):
return pd.read_csv(filename)
def main():
myfile = "C:\\excel_processor\\1_wht_escrow_verified_id -1.csv"
pool = mp.Pool(2)
mydf = pool.map(read_csv, myfile)
if __name__ == '__main__':
main()
I am using jupyter notebook to count the occurrence of a value in multiple csv files. I have around 60 csv files, each about 1GB big. To efficiently loop through them, I use multithreading. However, the kernel keeps dying whenever I execute the following code:
from multiprocessing.dummy import Pool as ThreadPool
files = glob.glob(path + '/*.csv')
def func(f):
df = pd.read_csv(f)
df = df[df['key'] == 1]
return df['key'].value_counts()
pool = ThreadPool(4)
results = pool.map(func, files)
pool.close()
pool.join()
results
What could be the reason for this? Is there a way to fix this?
There are two issues in your code.
For Python, you are actually using multi-threading instead of multi-processing by using the Pool in multiprocessing.dummy. Change to below if you wanted to use multi-processing:
from multiprocessing import Pool
But as you mentioned there are ~60G data I'm afraid your local computer can't handle this?
I believe you need a powerful cluster for this task (no more pandas). so you may need to consider something like Spark.
df = spark.read.csv(your_file_list, header=True)
df = df.filter(df.Key == 1)
df.head(5) # you can use df.collect() if the resultset if not too large
I ran into a pickle (literally) in parallelizing the following Python code and could really need some help.
First of all the input is a CSV file consisting of a list of website links that I need to scrape with the function scrape_function(). The original code is as follows and runs perfectly
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
m=[]
for i in inputlist:
m.append(scrape_code(re.sub("\'|\[|\]",'',str(i)))) #remove the quotes around the link strings otherwise it results in URLError
print(m)
I then tried to parallelize this code using joblib as follows:
from joblib import Parallel, delayed
import multiprocessing
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(m.append(scrape_code(re.sub("\'|\[|\]",'',str(i))))) for i in inputlist)
However, this would result in a weird error:
File "C:\Users\...\joblib\pool.py", line 371, in send
CustomizablePickler(buffer, self._reducers).dump(obj)
AttributeError: Can't pickle local object 'delayed.<locals>.delayed_function'
Any idea what I did wrong here? If I try to put the append in a separate function like below then the error would go away, but the execution would then freeze and hang indefinitely:
def process(k):
a=[]
a.append(scrape_code(re.sub("\'|\[|\]",'',str(k))))
return a
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(process)(i) for i in inputlist)
The input list has 10000s of pages so parallel processing would be a huge benefit.
If you really need it in separate processes, the easiest way is to just create a process pool and let it deal with distributing the links to your function, e.g.:
import csv
from multiprocessing import Pool
if __name__ == "__main__": # multiprocessing guard
with open("c:\\links.csv", "r", newline="") as f: # open the CSV
reader = csv.reader(f) # create a reader
links = [r[0] for r in reader] # collect only the first column
with Pool() as pool: # create a pool, it will make a pool with all your CPU cores...
results = pool.map(scrape_code, links) # distribute your links to scrape_code
print(results)
NOTE: I'm assuming your links.csv actually holds the link in its first column based on how you're pre-processing the links in your code.
However, as I've stated in my comment, this doesn't have to be necessarily faster than plain threading so I'd first try it using threads. Fortunately, the multiprocessing module includes a threading interfrace dummy so you just need to replace from multiprocessing import Pool with from multiprocessing.dummy import Pool and see in what regime your code works faster.
I am trying to figure out how to write a program that performs computations in parallel such that the result of each computation can be written to a file in a specific order. My problem is size; I would like to do what I've outlined in the sample program below - save the large output as the value of a dictionary which stores the ordering system in its keys. But my program keeps breaking because it can't store/pass around so many bytes.
Is there a set way to approach such problems? I'm new to dealing with both multiprocessing and large data.
from multiprocessing import Process, Manager
def eachProcess(i, d):
LARGE_BINARY_OBJECT = #perform some computation resulting in millions of bytes
d[i] = LARGE_BINARY_OBJECT
def main():
manager = Manager()
d = manager.dict()
maxProcesses = 10
for i in range(maxProcesses):
process = Process(target=eachProcess, args=(i,d))
process.start()
counter = 0
while counter < maxProcesses:
file1 = open("test.txt", "wb")
if counter in d:
file1.write(d[counter])
counter += 1
if __name__ == '__main__':
main()
Thank you.
When dealing with large data usually the approaches are two:
Local file system if the problem is simple enough
Remote data storage if more complex support over data is needed
As your problem seems pretty simple, I'd suggest the following solution. Each process writes its partial solution to a local file. Once all processing is done, the main process combines all result files together.
from multiprocessing import Pool
from tempfile import NamedTemporaryFile
def worker_function(partial_result_path):
data = produce_large_binary()
with open(partial_result_path, 'wb') as partial_result_file:
partial_result_file.write(data)
# storing partial results in temporary files
partial_result_paths = [NamedTemporaryFile() for i in range(max_processes)]
pool = Pool(max_processes)
pool.map(worker_function, partial_result_paths)
with open('test.txt', 'wb') as result_file:
for partial_result_path in partial_result_paths:
with open(partial_result_path) as partial_result_file:
result_file.write(partial_result_file.read())
I am using Python multiprocessing, more precisely
from multiprocessing import Pool
p = Pool(15)
args = [(df, config1), (df, config2), ...] #list of args - df is the same object in each tuple
res = p.map_async(func, args) #func is some arbitrary function
p.close()
p.join()
This approach has a huge memory consumption; eating up pretty much all my RAM (at which point it gets extremely slow, hence making the multiprocessing pretty useless). I assume the problem is that df is a huge object (a large pandas dataframe) and it gets copied for each process. I have tried using multiprocessing.Value to share the dataframe without copying
shared_df = multiprocessing.Value(pandas.DataFrame, df)
args = [(shared_df, config1), (shared_df, config2), ...]
(as suggested in Python multiprocessing shared memory), but that gives me TypeError: this type has no size (same as Sharing a complex object between Python processes?, to which I unfortunately don't understand the answer).
I am using multiprocessing for the first time and maybe my understanding is not (yet) good enough. Is multiprocessing.Value actually even the right thing to use in this case? I have seen other suggestions (e.g. queue) but am by now a bit confused. What options are there to share memory, and which one would be best in this case?
The first argument to Value is typecode_or_type. That is defined as:
typecode_or_type determines the type of the returned object: it is
either a ctypes type or a one character typecode of the kind used by
the array module. *args is passed on to the constructor for the type.
Emphasis mine. So, you simply cannot put a pandas dataframe in a Value, it has to be a ctypes type.
You could instead use a multiprocessing.Manager to serve your singleton dataframe instance to all of your processes. There's a few different ways to end up in the same place - probably the easiest is to just plop your dataframe into the manager's Namespace.
from multiprocessing import Manager
mgr = Manager()
ns = mgr.Namespace()
ns.df = my_dataframe
# now just give your processes access to ns, i.e. most simply
# p = Process(target=worker, args=(ns, work_unit))
Now your dataframe instance is accessible to any process that gets passed a reference to the Manager. Or just pass a reference to the Namespace, it's cleaner.
One thing I didn't/won't cover is events and signaling - if your processes need to wait for others to finish executing, you'll need to add that in. Here is a page with some Event examples which also cover with a bit more detail how to use the manager's Namespace.
(note that none of this addresses whether multiprocessing is going to result in tangible performance benefits, this is just giving you the tools to explore that question)
You can use Array instead of Value for storing your dataframe.
The solution below converts a pandas dataframe to an object that stores its data in shared memory:
import numpy as np
import pandas as pd
import multiprocessing as mp
import ctypes
# the origingal dataframe is df, store the columns/dtypes pairs
df_dtypes_dict = dict(list(zip(df.columns, df.dtypes)))
# declare a shared Array with data from df
mparr = mp.Array(ctypes.c_double, df.values.reshape(-1))
# create a new df based on the shared array
df_shared = pd.DataFrame(np.frombuffer(mparr.get_obj()).reshape(df.shape),
columns=df.columns).astype(df_dtypes_dict)
If now you share df_shared across processes, no additional copies will be made. For you case:
pool = mp.Pool(15)
def fun(config):
# df_shared is global to the script
df_shared.apply(config) # whatever compute you do with df/config
config_list = [config1, config2]
res = p.map_async(fun, config_list)
p.close()
p.join()
This is also particularly useful if you use pandarallel, for example:
# this will not explode in memory
from pandarallel import pandarallel
pandarallel.initialize()
df_shared.parallel_apply(your_fun, axis=1)
Note: with this solution you end up with two dataframes (df and df_shared), which consume twice the memory and are long to initialise. It might be possible to read the data directly in shared memory.
At least Python 3.6 supports to store a pandas DataFrame as a multiprocessing.Value. See below a working example:
import ctypes
import pandas as pd
from multiprocessing import Value
df = pd.DataFrame({'a': range(0,9),
'b': range(10,19),
'c': range(100,109)})
k = Value(ctypes.py_object)
k.value = df
print(k.value)
You can share a pandas dataframe between processes without any memory overhead by creating a data_handler child process. This process receives calls from the other children with specific data requests (i.e. a row, a specific cell, a slice etc..) from your very large dataframe object. Only the data_handler process keeps your dataframe in memory unlike a Manager like Namespace which causes the dataframe to be copied to all child processes. See below for a working example. This can be converted to pool.
Need a progress bar for this? see my answer here: https://stackoverflow.com/a/55305714/11186769
import time
import Queue
import numpy as np
import pandas as pd
import multiprocessing
from random import randint
#==========================================================
# DATA HANDLER
#==========================================================
def data_handler( queue_c, queue_r, queue_d, n_processes ):
# Create a big dataframe
big_df = pd.DataFrame(np.random.randint(
0,100,size=(100, 4)), columns=list('ABCD'))
# Handle data requests
finished = 0
while finished < n_processes:
try:
# Get the index we sent in
idx = queue_c.get(False)
except Queue.Empty:
continue
else:
if idx == 'finished':
finished += 1
else:
try:
# Use the big_df here!
B_data = big_df.loc[ idx, 'B' ]
# Send back some data
queue_r.put(B_data)
except:
pass
# big_df may need to be deleted at the end.
#import gc; del big_df; gc.collect()
#==========================================================
# PROCESS DATA
#==========================================================
def process_data( queue_c, queue_r, queue_d):
data = []
# Save computer memory with a generator
generator = ( randint(0,x) for x in range(100) )
for g in generator:
"""
Lets make a request by sending
in the index of the data we want.
Keep in mind you may receive another
child processes return call, which is
fine if order isnt important.
"""
#print(g)
# Send an index value
queue_c.put(g)
# Handle the return call
while True:
try:
return_call = queue_r.get(False)
except Queue.Empty:
continue
else:
data.append(return_call)
break
queue_c.put('finished')
queue_d.put(data)
#==========================================================
# START MULTIPROCESSING
#==========================================================
def multiprocess( n_processes ):
combined = []
processes = []
# Create queues
queue_data = multiprocessing.Queue()
queue_call = multiprocessing.Queue()
queue_receive = multiprocessing.Queue()
for process in range(n_processes):
if process == 0:
# Load your data_handler once here
p = multiprocessing.Process(target = data_handler,
args=(queue_call, queue_receive, queue_data, n_processes))
processes.append(p)
p.start()
p = multiprocessing.Process(target = process_data,
args=(queue_call, queue_receive, queue_data))
processes.append(p)
p.start()
for i in range(n_processes):
data_list = queue_data.get()
combined += data_list
for p in processes:
p.join()
# Your B values
print(combined)
if __name__ == "__main__":
multiprocess( n_processes = 4 )
I was pretty surprised that joblib's Parallel (since 1.0.1 at least) supports sharing pandas dataframes with multiprocess workers out of the box already. At least with the 'loky' backend.
One thing I figured out experimentally: parameters you pass to the function should not contain any large dict. If they do, turn the dict into a Series or Dataframe.
Some additional memory for sure gets used by each worker, but much less than the size of your supposedly 'big' dataframe residing in the main process. And the computation begins right away in all workers. Otherwise, joblib starts all your requested workers, but they are hanging idle while objects are copied into each one sequentially, which is taking a long time. I can provide a code sample if someone needs it. I have tested dataframes processing only in read-only mode. The feature is not mentioned in the docs but it works for Pandas.