[Python 3.1]
My program takes a long time to run just because of the pickle.load method on a huge data structure. This makes debugging very annoying and time-consuming: every time I make a small change, I need to wait for a few minutes to see if the regression tests passed.
I would like replace pickle with an in-memory data structure.
I thought of starting a python program in one process, and connecting to it from another; but I am afraid the inter-process communication overhead will be huge.
Perhaps I could run a python function from the interpreter to load the structure in memory. Then as I modify the rest of the program, I can run it many times (without exiting the interpreter in between). This seems like it would work, but I'm not sure if I will suffer any overhead or other problems.
You can use mmap to open a view on the same file in multiple processes, with access at almost the speed of memory once the file is loaded.
First you can pickle different parts of the hole object using this method:
# gen_objects.py
import random
import pickle
class BigBadObject(object):
def __init__(self):
self.a_dictionary={}
for x in xrange(random.randint(1, 1000)):
self.a_dictionary[random.randint(1,98675676)]=random.random()
self.a_list=[]
for x in xrange(random.randint(1000, 10000)):
self.a_list.append(random.random())
self.a_string=''.join([chr(random.randint(65, 90))
for x in xrange(random.randint(100, 10000))])
if __name__=="__main__":
output=open('lotsa_objects.pickled', 'wb')
for i in xrange(10000):
pickle.dump(BigBadObject(), output, pickle.HIGHEST_PROTOCOL)
output.close()
Once you generated the BigFile in various separate parts you can read it with a python program with several running at the same time reading each one different parts.
# reader.py
from threading import Thread
from Queue import Queue, Empty
import cPickle as pickle
import time
import operator
from gen_objects import BigBadObject
class Reader(Thread):
def __init__(self, filename, q):
Thread.__init__(self, target=None)
self._file=open(filename, 'rb')
self._queue=q
def run(self):
while True:
try:
one_object=pickle.load(self._file)
except EOFError:
break
self._queue.put(one_object)
class uncached(object):
def __init__(self, filename, queue_size=100):
self._my_queue=Queue(maxsize=queue_size)
self._my_reader=Reader(filename, self._my_queue)
self._my_reader.start()
def __iter__(self):
while True:
if not self._my_reader.is_alive():
break
# Loop until we get something or the thread is done processing.
try:
print "Getting from the queue. Queue size=", self._my_queue.qsize()
o=self._my_queue.get(True, timeout=0.1) # Block for 0.1 seconds
yield o
except Empty:
pass
return
# Compute an average of all the numbers in a_lists, just for show.
list_avg=0.0
list_count=0
for x in uncached('lotsa_objects.pickled'):
list_avg+=reduce(operator.add, x.a_list)
list_count+=len(x.a_list)
print "Average: ", list_avg/list_count
This way of reading the pickle file will take 1% of the time it takes in the other way. This is because you are running 100 parallel threads at the same time.
Related
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I want to read and process a file by using multiprocessing with low memory consumption, high throughput (sentence/s), and - especially important - ordered results.
I was wondering whether we can use linecache's getline for this purpose. The following code reads a file, hopefully in parallel, and executes some function on the lines that are gathered in the subprocess. Here I opted for running some tokenisation on the files with spaCy.
import datetime
from multiprocessing import Pool, current_process
from os import cpu_count
from pathlib import Path
from functools import partial
from linecache import getline
import spacy
class Processor:
def __init__(self, spacy_model='en_core_web_sm', batch_size=2048):
self.nlp = spacy.load(spacy_model, disable=['ner', 'textcat'])
self.batch_size = batch_size
#staticmethod
def get_n_lines(pfin):
with pfin.open(encoding='utf-8') as fhin:
for line_idx, _ in enumerate(fhin, 1):
pass
return line_idx
def process_file(self, fin):
pfin = Path(fin).resolve()
total_lines = self.get_n_lines(pfin)
start_time = datetime.datetime.now()
procfunc = partial(self.process_batch, pfin)
with Pool(cpu_count() - 1) as pool:
# map the starting indexex to the processes
for _ in pool.imap(procfunc, range(0, total_lines+1, self.batch_size)):
pass
print('done', (datetime.datetime.now() - start_time).total_seconds())
def process_batch(self, pfin, start):
lines = [getline(str(pfin), i) for i in range(start, start+self.batch_size)]
# Parse text with spaCy
docs = list(self.nlp.pipe(lines))
# Chop into sentences
spacy_sents = [str(sent) for doc in docs for sent in doc.sents]
return str(current_process()), spacy_sents
if __name__ == '__main__':
fn = r'data/train.tok.low.en'
proc = Processor()
proc.process_file(fn)
I found that on my work laptop, running with 3 active cores on a file of 140K sentences the duration is 261 seconds. When running with a single core (n_workers=1), the processing time is 431 seconds. I am not sure how to interpret this difference but I guess it comes down to the question: does linecache.getline allow for concurrent reading? Parallel execution s faster, but considering getline expects a file name (rather than a file object), I expect it to have to open the file every time and as such blocking access for other processes. Is this assumption correct because parallel execution still seems much faster? Is there a better way to read files fast and in parallel whilst also keeping the results ordered?
You don't need linecache, and it doesn't help.
First, you don't need any special tricks to read the same file simultaneously from multiple processes. You can just do it. It'll work.
Second, linecache loads a whole file immediately as soon as a single line is requested from that file. You're not splitting the work of reading the file at all. You're doing more I/O than if you just had the parent process read the file and let the workers inherit the data. If you're getting any speedup from multiprocessing, it's probably due to parallelizing the NLP work, not the file reading.
Third, linecache is designed to support the traceback module, and it does a lot of stuff that doesn't make sense for a general-purpose file reading module, including searching the import path for a file if it doesn't find the file in the current directory.
Python 3
I would like to know what a really clean, pythonic concurrent data loader should look like. I need this approach for a project of mine that does heavy computations on data that is too big to entirely fit into memory. Hence, I implemented data loaders that should run concurrently and store data in a queue, so that the main process can work while (in the mean time) the next data is being loaded & prepared. Of course, the queue should block when it is empty (main process trying to consume more items -> queue should wait for new data) or full (worker process should wait until main process consumes data out of the queue to prevent out-of-memory errors).
I have written a class to fulfill this need using Python's multiprocessing module (multiprocessing.Queue and multiprocessing.Process). The crucial parts of the class are implemented as follows:
import multiprocessing as mp
from itertools import cycle
class ConcurrentLoader:
def __init__(path_to_data, queue_size, batch_size):
self._batch_size
self._path = path_to_data
filenames = ... # filenames for path 'path_to_data',
# get loaded using glob
self._files = cycle()
self._q = mp.Queue(queue_size)
...
self._worker = mp.Process(target=self._worker_func, daemon=True)
self._worker.start() # only started, never stopped
def _worker_func(self):
while True:
buffer = list()
for i in range(batch_size):
f = next(self._files)
... # load f and do some pre-processing with NumPy
... # add it to buffer
self._q.put(np.array(buffer).astype(np.float32))
def get_batch_data(self):
self._q.get()
The class has some more methods, but they are all for "convenience functionality". For example, it counts in a dict how often each file was loaded, how often the whole data set was loaded and so on, but these are rather easy to implement in Python and do not waste much computation time (sets, dicts, ...).
The data part itself on the other hand, due to I/O and pre-processing, can even take seconds. That is the reason why I want this to happen concurrently.
ConcurrentLoader should:
block main process: if get_batch_data is called, but queue is empty
block worker process: if queue is full, to prevent out-of-memory errors and prevent while True from wasting resources
be "transparent" to any class that uses ConcurrentLoader: they should just supply the path to the data and use get_batch_data without noticing that this actually works concurrently ("hassle free usage")
terminate its worker when main process dies to free resources again
Considering these goals (have I forgotten anything?) what should I do to enhance the current implementation? Is it thread/dead-lock safe? Is there a more "pythonic" way of implementation? Can I get it more clean? Does waste resources somehow?
Any class that uses ConcurrentLoader would roughly follow this setup:
class Foo:
...
def do_something(self):
...
data1 = ConcurrentLoader("path/to/data1", 64, 8)
data2 = ConcurrentLoader("path/to/data2", 256, 16)
...
sample1 = data1.get_batch_data()
sample2 = data2.get_batch_data()
... # heavy computations with data contained in 'sample1' & 'sample2'
# go *here*
Please either point out mistakes of any kind in order to improve my approach or supply an own, cleaner, more pythonic approach.
Blocking when a multiprocessing.Queue is empty/full and
get()/put() is called on it happens automatically.
This behavior is transparent to calling functions.
Use self._worker.daemon = True before self._worker.start() so the worker(s) will automatically be killed when main process exits
I'm trying to read a csv file and calculate the time required for linear processing, using threads and multiprocessing but my code doesn't seems to get me the correct output. It would be great if someone could help me with the codes.
Multi processing and Linear Processing:
import csv
import time
import multiprocessing
from multiprocessing import Process
proces=[]
number_of_processes=2
class mulprocess():
def data(self):
with open('test.csv','r+') as f:
reader=csv.reader(f)
for row in reader:
print row
def processdata(self):
for i in range(2):
start_time=time.time()
proces.append(multiprocessing.Process(target=self.data(),args=()))
for p in proces:
p.start()
for p in proces:
p.join()
end_time=time.time()
print end_time-start_time
a=mulprocess()
a.data()
a.processdata()
Threading:
import csv
import time
import threading
thread_count=2
threads=[]
class Operation():
def data(self):
with open('test.csv','r+') as f:
reader=csv.reader(f)
for row in reader:
print row
def filedata(self):
for i in range(thread_count):
threads.append(threading.Thread(target=self.data,args=()))
start_time=time.time()
for t in threads:
t.start()
t.join()
end_time=time.time()
print end_time-start_time
a=Operation()
a.filedata()
Ignoring the pointlessness of loading the same file over and over in a different process (which, btw. might can cause concurrency error because you're loading your file in r+ mode, effectively locking it) and the general ill-advised approach of calling instance methods as separate processes, the crux of your problem lies in the line:
proces.append(multiprocessing.Process(target=self.data(), args=()))
What you're telling the multiprocessing.Process instance there is to set the target to whatever your self.data() method returns, which is None, so your process doesn't get anything to work with or to call so everything executes in your main process and, of course, it takes double the time. Check this answer to see how to set up multiprocessing benchmark properly.
As for threading - no reason to try it out, it will be slower for processing than a single-threaded approach. The only advantage you can have with threading is semi-parallelizing I/O operations.
I am looking to use multiprocessing or threading in my application to do some time-consuming operations in the background. I have looked at many examples, but I still have been unable to achieve what I want. I am trying to load a bunch of images, each of which takes several seconds. I would like the first image to be loaded and then have the others loading in the background and being stored in a list (to use later) while the program is still doing other things (like allowing controls on my GUI to still work). If I have something like the example below, how can I do this? And should I use multiprocessing or threading?
class myClass():
def __init__(self, arg1, arg2):
#initializes some parameters
def aFunction(self):
#does some things
#creates multiple processes or threads that each call interestingFunc
#continues doing things
def interestingFunc(self):
#performs operations
m = myClass()
You can use either approach. Have your Process or Thread perform its work and then put the results onto a Queue. Your main thread/process can then, at its leisure, take the results off the queue and do something with them. Here's an example with multiprocessing.
from multiprocessing import Process, Queue
def load_image(img_file, output_q):
with open(img_file, 'rb') as f:
img_data = f.read()
# perform processing on img_data, then queue results
output_q.put((img_file, img_data))
result_q = Queue()
images = ['/tmp/p1.png', '/tmp/p2.jpg', '/tmp/p3.gif', '/tmp/p4.jpg']
for img in images:
Process(target=load_image, args=(img, result_q)).start()
for i in range(len(images)):
img, data = result_q.get()
# do something with the image data
print "processing of image file %s complete" % img
This assumes that the order of processing is not significant to your application, i.e. the image data from each file might be loaded onto the queue in any particular order.
Here's the simplest possible way to do multiple things in parallel, it'll help get you started:
source
import multiprocessing
def calc(num):
return num*2
pool = multiprocessing.Pool(5)
for output in pool.map(calc, [1,2,3]):
print 'output:',output
output
output: 2
output: 4
output: 6
You could try something like this:
from thread import start_new_thread
pictureList = [ f for f in os.listdir(r"C:\your\picture\folder")]
for pic in pictureList:
start_new_thread(loadPicture,(pic,))
def loadPicture(pic):
pass # do some stuff with the pictures
This is a quite simple approach, the thread returns immediatley and perhaps you'll need to use allocate_lock. If you need more capabilities, you might consider using the threading module. Be careful to pass a tuple as 2nd argument to the thread.