How to stream results from Multiprocessing.Pool to csv? - python

I have a python process (2.7) that takes a key, does a bunch of calculations and returns a list of results. Here is a very simplified version.
I am using multiprocessing to create threads so this can be processed faster. However, my production data has several million rows and each loop takes progressively longer to complete. The last time I ran this each loop took over 6 minutes to complete while at the start it takes a second or less. I think this is because all the threads are adding results into resultset and that continues to grow until it contains all the records.
Is it possible to use multiprocessing to stream the results of each thread (a list) into a csv or batch resultset so it writes to the csv after a set number of rows?
Any other suggestions for speeding up or optimizing the approach would be appreciated.
import numpy as np
import pandas as pd
import csv
import os
import multiprocessing
from multiprocessing import Pool
global keys
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop,(key for key in keys) )
loaddata = []
for sublist in resultset:
loaddata.append(sublist)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in loaddata:
writer.writerow(listitem)
file.close
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise

Here is an answer consolidating the suggestions Eevee and I made
import numpy as np
import pandas as pd
import csv
from multiprocessing import Pool
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop, keys, chunksize=200)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in resultset:
writer.writerow(listitem)
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise
Again, the changes here are
Iterate over resultset directly, rather than needlessly copying it to a list first.
Feed the keys list directly to pool.imap instead of creating a generator comprehension out of it.
Providing a larger chunksize to imap than the default of 1. The larger chunksize reduces the cost of the inter-process communication required to pass the values inside keys to the sub-processes in your pool, which can give big performance boosts when keys is very large (as it is in your case). You should experiment with different values for chunksize (try something considerably larger than 200, like 5000, etc.) and see how it affects performance. I'm making a wild guess with 200, though it should definitely do better than 1.

The following very simple code collects many worker's data into a single CSV file. A worker takes a key and returns a list of rows. The parent processes several keys at a time, using several workers. When each key is done, the parent writes output rows, in order, to a CSV file.
Be careful about order. If each worker writes to the CSV file directly, they'll be out of order or will stomp on each others. Having each worker write to its own CSV file will be fast, but will require merging all the data files together afterward.
source
import csv, multiprocessing, sys
def worker(key):
return [ [key, 0], [key+1, 1] ]
pool = multiprocessing.Pool() # default 1 proc per CPU
writer = csv.writer(sys.stdout)
for resultset in pool.imap(worker, [1,2,3,4]):
for row in resultset:
writer.writerow(row)
output
1,0
2,1
2,0
3,1
3,0
4,1
4,0
5,1

My bet would be that dealing with the large structure at once using appending is what makes it slow. What I usually do is that I open up as many files as cores and use modulo to write to each file immediately such that the streams don't cause trouble compared to if you'd direct them all into the same file (write errors), and also not trying to store huge data. Probably not the best solution, but really quite easy. In the end you just merge back the results.
Define at start of the run:
num_cores = 8
file_sep = ","
outFiles = [open('out' + str(x) + ".csv", "a") for x in range(num_cores)]
Then in the key_loop function:
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
outFiles[key % num_cores].write(file_sep.join([str(x) for x in test_list])
+ "\n")
Afterwards, don't forget to close: [x.close() for x in outFiles]
Improvements:
Iterate over blocks like mentioned in the comments. Writing/processing 1 line at a time is going to be much slower than writing blocks.
Handling errors (closing of files)
IMPORTANT: I'm not sure of the meaning of the "keys" variable, but the numbers there will not allow modulo to ensure you have each process write to each individual stream (12 keys, modulo 8 will make 2 processes write to the same file)

Related

Multiprocessing a loop

I have a script that loops over a pandas dataframe and outputs GIS data to a geopackage based on some searches and geometry manipulation. It works when I use a for loop but with over 4k records it takes a while. Since I have it built as it's own function that returns what I need based on a row iteration I tried to run it with multiprocessing with:
import pandas as pd, bwe_mapping
from multiprocessing import Pool
#Sample dataframe
bwes = [['id', 7216],['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92], ['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwedf = pd.read_csv(bwes)
geopackage = "datalocation\geopackage.gpkg"
tracklayer = "tracks"
if __name__=='__main__':
def task(item):
bwe_mapping.map_bwe(item, geopackage, tracklayer)
pool = Pool()
for index, row in bwedf.iterrows():
task(row)
with Pool() as pool:
for results in pool.imap_unordered(task, bwedf.iterrows()):
print(results)
When I run this my Task manager populates with 16 new python tasks but no sign that anything is being done. Would it be better to use numpy.array.split() to break up my pandas df into 4 or 8 smaller ones and run the for index, row in bwedf.iterrows(): for each dataframe on it's own processor?
No one process needs to be done in any order; as long as I can store the outputs, which are geopanda dataframes, into a list to concatenate into geopackage layers at the end.
Should I have put the for loop in the function and just passed it the whole dataframe and gis data to search?
if you are running on windows/macOS then it's going to use spawn to create the workers, which means that any child MUST find the function it is going to execute when it imports your main script.
your code has the function definition inside your if __name__=='__main__': so the children don't have access to it.
simply moving the function def to before if __name__=='__main__': will make it work.
what is happening is that each child is crashing when it tries to run a function because it never saw its definition.
minimal code to reproduce the problem:
from multiprocessing import Pool
if __name__ == '__main__':
def task(item):
print(item)
return item
pool = Pool()
with Pool() as pool:
for results in pool.imap_unordered(task, range(10)):
print(results)
and the solution is to move the function definition to before the if __name__=='__main__': line.
Edit: now to iterate on rows in a dataframe, this simple example demonstrates how to do it, note that iterrows returns an index and a row, which is why it is unpacked.
import os
import pandas as pd
from multiprocessing import Pool
import time
# Sample dataframe
bwes = [['id', 7216], ['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92],
['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwef = pd.DataFrame(bwes)
def task(item):
time.sleep(1)
index, row = item
# print(os.getpid(), tuple(row))
return str(os.getpid()) + " " + str(tuple(row))
if __name__ == '__main__':
with Pool() as pool:
for results in pool.imap_unordered(task, bwef.iterrows()):
print(results)
the time.sleep(1) is only there because there is only a small amount of work and one worker might grab it all, so i am forcing every worker to wait for the others, you should remove it, the result is as follows:
13228 ('id', 7216)
11376 ('item_id', 3277841)
15580 ('Date', '2019-01-04T00:00:00.000Z')
10712 ('start_lat', -56.92)
11376 ('End_lat', -59.87)
13228 ('start_lon', 45.87)
10712 ('End_lon', 44.67)
it seems like your "example" dataframe is transposed, but you just have to construct the dataframe correctly, i'd recommend you first run the code serially with iterrows, before running it across multiple cores.
obviously sending data to the workers and back from them takes time, so make sure each worker is doing a lot of computational work and not just sending it back to the parent process.

Trying to apply multi-processing to a function that processes xml to csv

I am trying to process ~477,000 XML files to csv using a library called irsx. The process is taking ages, so I am attempting to find ways to speed it up. Does anyone know how I can effectively apply multi-processing to this function and utilize all my computer's cores?
I have tried creating a pool and using .apply_async() but it didn't work as expected.
import os
from irsx.xmlrunner import XMLRunner
import pandas as pd
import time
import flatdict
from collections import defaultdict
import multiprocessing as mp
import glob
Frames1 = pd.DataFrame()
directory = "/Users/upmetrics/Desktop/990ALLXML/"
def listdir_nohidden(path):
for f in os.listdir(path):
if not f.startswith('.'):
yield f
myfiles = list(listdir_nohidden(directory))
listfiles = len([str(file) for file in listdir_nohidden(directory)])
dataframes = {}
mydict = {}
def Process():
current_file = 1
for file in listdir_nohidden(directory):
# Get just the id of the 990 record from the file name
record_id = file.split('_')[0]
parsed_filing = XMLRunner().run_filing(record_id)
progress = (current_file / listfiles) * 100
if current_file % 100 == 0:
print(("{}% Complete!").format(round(progress,2)))
print(("{} out of {} processed.").format(current_file, listfiles))
for sked in parsed_filing.get_result():
fields = flatdict.FlatterDict(sked['schedule_parts'], delimiter=":")
dictionary_of_fields = defaultdict(list)
for key, value in fields.items():
dictionary_of_fields[key].append(value)
if sked['schedule_name'] in dataframes.keys():
# Add new data to an existing section
current_frame = dataframes[sked['schedule_name']]
new_frame = pd.DataFrame().from_dict(dictionary_of_fields)
updated_frame = pd.concat([current_frame, new_frame], join='outer', sort=True, ignore_index=True)
dataframes[sked['schedule_name']] = updated_frame
else:
# This section hasn't been seen yet - create it
dataframes[sked['schedule_name']] = pd.DataFrame().from_dict(dictionary_of_fields)
current_file += 1
return dataframes
For such a job I would use Pool.imap_unordered.
Basically, because it starts yielding results as soon as they are available.
import multiprocessing as mp
import os
import base64
def worker(path):
# Generate random output file name
tmpname = base64.b64encode(os.urandom(12), b'__').decode() + '.csv'
with open(path) as f:
data = f.read()
# << process your data here, put csv formatted data in csvdata.... >>
with open(tmpname, 'w') as f:
f.write(csvdata)
return {'source': path, 'result': tmpname}
# Generate your list of filenames "myfiles" here.
p = mp.Pool()
for rv in p.imap_unordered(worker, myfiles):
print('Processed: ', rv['source'])
# Append the data from rv['result'] to a master csv file...
Edit: So why have each worker write to a file? You have a lot of data files. My assumption is that these each contain a significant amount of data.
You could have the worker return that data. But then the worker process would have to transfer that data back to the parent process. This is done using a SimpleQueue, if I read the source code correctly. It involves pickling the data in the worker process and unpickling in the parent process. Deep down it uses named pipes on ms-windows, and sockets on other systems.
These transfer mechanisms use small buffer sizes of 8192 bytes.
If OTOH, you use mmap to write and concatenate the data, most of it would still be in the OS's buffer cache from the write in the worker when you read/write it in the parent.
Obviously you'd have to run tests to determine which one is faster. A lot depends on the size of the data from the individual files.

How can I make python scripts run faster or use multiprocess in this case?

I'm trying to measure four similarities(cosine_similarity, jaccard, Sequence Matcher similarity, jaccard_variants similarity) over 800K pairs of documents.
Every document file is txt format and about 100KB ~ 300KB(About 1500000 characters).
I have two questions regarding how to make my python scripts faster:
MY PYTHON SCRIPTS:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import SequenceMatcher
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def measure_sim(doc1, doc2):
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
return cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)), \
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])), \
SequenceMatcher(None, a, b).ratio()
#items in doc_pair list are like('ID', 'doc1_directory', 'doc2_directory')
def data_analysis(doc_pair_list):
result = {}
for item in doc_pair_list:
f1 = open(item[1], 'rb')
doc1 = f1.read()
f1.close()
f2 = oepn(item[2], 'rb')
doc2 = f2.read()
f2.close()
result[item[0]] = measure_sim(doc1, doc2)
However, this code uses only 10% of my CPU and it takes almost 20 days to this task to be done. So I want to ask if there would be any way to make this code more efficient.
Q1. Since Documents are saved in HDD, I thought loading those text data should take some time. Hence, I suspect that loading only two documents every time the computer computes the similarities might not be efficient. Hence I am going to try loading 50 pairs of documents at once and computes similarity respectively. Would it be helpful?
Q2. Most of the postings about "How to make your codes run faster" said that I should use Python module based on C-code. However, since I'm using sklearn module which is known to be quite efficient, I wonder there would be any better way.
Is there any way that could help this python script to use more computer resources and become faster??
There are maybe better solutions, but you may try something like this, if the counting of similarities is the blocker:
1) A separate process to read all the files one by one and put them to a multiprocessing.Queue
2) Pool of multiple worker processes to count the similarities and put results into multiprocessing.Queue.
3) Main thread then simply loads results from results_queue and save them to dictionary as you have it now.
I don't know your hardware limitations (number and speed of CPU cores, RAM size, disk read speed) and I don't have any samples to test it on.
EDIT: Below is provided the described code. Please try and check if it is faster and let me know. If the main blocker is loading of files, we can create more loader processes (e.g. 2 processes and each loads half of the files). If the blocker is calculating similarities, then you can create more worker processes (just change worker_count). Finally 'results' is the dictionary with all the results.
import multiprocessing
import os
from difflib import SequenceMatcher
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_tf_vectors(doc1, doc2):
text = [doc1, doc2]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
def calculate_similarities(doc_pairs_queue, results_queue):
""" Pick docs from doc_pairs_queue and calculate their similarities, save the result to results_queue. Repeat infinitely (until process is terminated). """
while True:
pair = doc_pairs_queue.get()
pair_id = pair[0]
doc1 = pair[1]
doc2 = pair[2]
a, b = doc1.split(), doc2.split()
c, d = set(a), set(b)
vectors = [t for t in get_tf_vectors(doc1, doc2)]
results_queue.put((pair_id, cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)),
1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])),
SequenceMatcher(None, a, b).ratio()))
def load_files(doc_pair_list, loaded_queue):
"""
Pre-load files and put them to a queue, so working processes can get them.
:param doc_pair_list: list of files to be loaded (ID, doc1_path, doc2_path)
:param loaded_queue: multiprocessing.Queue that will hold pre-loaded data
"""
print("Started loading files...")
for item in doc_pair_list:
with open(item[1], 'rb') as f1:
with open(item[2], 'rb') as f2:
loaded_queue.put((item[0], f1.read(), f2.read())) # if queue is full, this automatically waits until there is space
print("Finished loading files.")
def data_analysis(doc_pair_list):
# create a loader process that will pre-load files (it does no calculations, so it loads much faster)
# loader puts loaded files to a queue; 1 pair ~ 500 KB, 1000 pairs ~ 500 MB max size of queue (RAM memory)
loaded_pairs_queue = multiprocessing.Queue(maxsize=1000)
loader = multiprocessing.Process(target=load_files, args=(doc_pair_list, loaded_pairs_queue))
loader.start()
# create worker processes - these will do all calculations
results_queue = multiprocessing.Queue(maxsize=1000) # workers put results to this queue
worker_count = os.cpu_count() if os.cpu_count() else 2 # number of worker processes
workers = [] # create list of workers, so we can terminate them later
for i in range(worker_count):
worker = multiprocessing.Process(target=calculate_similarities, args=(loaded_pairs_queue, results_queue))
worker.start()
workers.append(worker)
# main process just picks the results from queue and saves them to the dictionary
results = {}
i = 0 # results counter
pairs_count = len(doc_pair_list)
while i < pairs_count:
res = results_queue.get(timeout=600) # timeout is just in case something unexpected happened (results are calculated much quicker)
# Queue.get() is blocking - if queue is empty, get() waits until something is put into queue and then get it
results[res[0]] = res[1:] # save to dictionary by ID (first item in the result)
# clean up the processes (so there aren't any zombies left)
loader.terminate()
loader.join()
for worker in workers:
worker.terminate()
worker.join()
Let me know about the results please, I am quite interested in it and will assist you further if needed ;)
First thing to do is see if you can find the real bottleneck and I think using cProfile might confirm your suspicion or shed some more light on your problem.
You should be able to run your code unmodified using cProfile like this:
python -m cProfile -o profiling-results python-file-to-test.py
After that you can analyze the results using pstats like this:
import pstats
stats = pstats.Stats("profiling-results")
stats.sort_stats("tottime")
stats.print_stats(10)
More on profiling your code is on Marco Bonazanin's blog article My Python Code is Slow? Tips for Profiling

Writing to file in Pool multiprocessing (Python 2.7)

I'm doing a lot of calculations writing the results to a file. Using multiprocessing I'm trying to parallelise the calculations.
Problem here is that I'm writing to one output file, which all the workers are writing too. I'm quite new to multiprocessing, and wondering how I could make it work.
A very simple concept of the code is given below:
from multiprocessing import Pool
fout_=open('test'+'.txt','w')
def f(x):
fout_.write(str(x) + "\n")
if __name__ == '__main__':
p = Pool(5)
p.map(f, [1, 2, 3])
The result I want would be a file with:
1 2 3
However now I get an empty file. Any suggestions?
I greatly appreciate any help :)!
You shouldn't be letting all the workers/processes write to a single file. They can all read from one file (which may cause slow downs due to workers waiting for one of them to finish reading), but writing to the same file will cause conflicts and potentially corruption.
Like said in the comments, write to separate files instead and then combine them into one on a single process. This small program illustrates it based on the program in your post:
from multiprocessing import Pool
def f(args):
''' Perform computation and write
to separate file for each '''
x = args[0]
fname = args[1]
with open(fname, 'w') as fout:
fout.write(str(x) + "\n")
def fcombine(orig, dest):
''' Combine files with names in
orig into one file named dest '''
with open(dest, 'w') as fout:
for o in orig:
with open(o, 'r') as fin:
for line in fin:
fout.write(line)
if __name__ == '__main__':
# Each sublist is a combination
# of arguments - number and temporary output
# file name
x = range(1,4)
names = ['temp_' + str(y) + '.txt' for y in x]
args = list(zip(x,names))
p = Pool(3)
p.map(f, args)
p.close()
p.join()
fcombine(names, 'final.txt')
It runs f for each argument combination which in this case are value of x and temporary file name. It uses a nested list of argument combinations since pool.map does not accept more than one arguments. There are other way to go around this, especially on newer Python versions.
For each argument combination and pool member it creates a separate file to which it writes the output. In principle your output will be longer, you can simply add another function that computes it to the f function. Also, no need to use Pool(5) for 3 arguments (though I assume that only three workers were active anyway).
Reasons for calling close() and join() are explained well in this post. It turns out (in the comment to the linked post) that map is blocking, so here you don't need them for the original reasons (wait till they all finish and then write to the combined output file from just one process). I would still use them in case other parallel features are added later.
In the last step, fcombine gathers and copies all the temporary files into one. It's a bit too nested, if you for instance decide to remove the temporary file after copying, you may want to use a separate function under the with open('dest', ).. or the for loop underneath - for readability and functionality.
Multiprocessing.pool spawns processes, writing to a common file without lock from each process can cause data loss.
As you said you are trying to parallelise the calculation, multiprocessing.pool can be used to parallelize the computation.
Below is the solution that do parallel computation and writes the result in file, hope it helps:
from multiprocessing import Pool
# library for time
import datetime
# file in which you want to write
fout = open('test.txt', 'wb')
# function for your calculations, i have tried it to make time consuming
def calc(x):
x = x**2
sum = 0
for i in range(0, 1000000):
sum += i
return x
# function to write in txt file, it takes list of item to write
def f(res):
global fout
for x in res:
fout.write(str(x) + "\n")
if __name__ == '__main__':
qs = datetime.datetime.now()
arr = [1, 2, 3, 4, 5, 6, 7]
p = Pool(5)
res = p.map(calc, arr)
# write the calculated list in file
f(res)
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000
# to compare the improvement using multiprocessing, iterative solution
qs = datetime.datetime.now()
for item in arr:
x = calc(item)
fout.write(str(x)+"\n")
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000

Optimize parsing of GB sized files in parallel

I have several compressed files with sizes on the order of 2GB compressed. The beginning of each file has a set of headers which I parse and extract a list of ~4,000,000 pointers (pointers).
For each pair of pointers (pointers[i], pointers[i+1]) for 0 <= i < len(pointers), I
seek to pointers[i]
read pointers[i+1]-pointer[i]
decompress it
do a single pass operation on that data and update a dictionary with what I find.
The issue is, I can only process roughly 30 of pointer pairs a second using a single Python process, which means each file takes more than a day to get through.
Assuming splitting up the pointers list among multiple processes doesn't hurt performance (due to each process looking at the same file, though different non-overlapping parts), how can I use multiprocessing to speed this up?
My single threaded operation looks like this:
def search_clusters(pointers, filepath, automaton, counter):
def _decompress_lzma(f, pointer, chunk_size=2**14):
# skipping over this
...
return uncompressed_buffer
first_pointer, last_pointer = pointers[0], pointers[-1]
with open(filepath, 'rb') as fh:
fh.seek(first_pointer)
f = StringIO(fh.read(last_pointer - first_pointer))
for pointer1, pointer2 in zip(pointers, pointers[1:]):
size = pointer2 - pointer1
f.seek(pointer1 - first_pointer)
buffer = _decompress_lzma(f, 0)
# skipping details, ultimately the counter dict is
# modified passing the uncompressed buffer through
# an aho corasick automaton
counter = update_counter_with_buffer(buffer, automaton, counter)
return counter
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
automaton = load_automaton()
search_clusters(pointers, infile, autmaton, counter)
I tried changing this to use multiprocessing.Pool:
from itertools import repeat, izip
import logging
import multiprocessing
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
def chunked(pointers, chunksize=1024):
for i in range(0, len(pointers), chunksize):
yield list(pointers[i:i+chunksize+1])
def search_wrapper(args):
return search_clusters(*args)
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
map_args = izip(chunked(cluster_pointers), repeat(infile),
repeat(automaton.copy()), repeat(word_counter.copy()))
pool = multiprocessing.Pool(20)
results = pool.map(search_wrapper, map_args)
pool.close()
pool.join()
but after a little while of processing, I get the following message and the script just hangs there with no further output:
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-20] child process calling self.run()
However, if I run with a serialized version of map without multiprocessing, things run just fine:
map(search_wrapper, map_args)
Any advice on how to change my multiprocessing code so it doesn't hang? Is it even a good idea to attempt to use multiple processes to read the same file?

Categories