A few days back I answered a question on SO regarding reading a tar file in parallel.
This was the gist of the question:
import bz2
import tarfile
from multiprocessing import Pool
tr = tarfile.open('data.tar')
def clean_file(tar_file_entry):
if '.bz2' not in str(tar_file_entry):
return
with tr.extractfile(tar_file_entry) as bz2_file:
with bz2.open(bz2_file, "rt") as bzinput:
# Reading bz2 file
....
....
def process_serial():
members = tr.getmembers()
processed_files = []
for i, member in enumerate(members):
processed_files.append(clean_file(member))
print(f'done {i}/{len(members)}')
def process_parallel():
members = tr.getmembers()
with Pool() as pool:
processed_files = pool.map(clean_file, members)
print(processed_files)
def main():
process_serial() # No error
process_parallel() # Error
if __name__ == '__main__':
main()
We were able to make the error disappear by just opening the tar file inside the child process rather than in the parent, as mentioned in the answer.
I am not able to understand why did this work.
Even if we open the tarfile in the parent process, the child process will get a new copy.
So why does opening the tarfile in the child process explicitly make any difference?
Does this mean that in the first case, the child processes were somehow mutating the common tarfile object and causing memory corruption due to concurrent writes?
FWIW, the answer in the comments wrt open is actually incorrect on UNIX-like systems regarding file handle numbers.
If multiprocessing uses fork() (which it does under Linux and similar, although I read there was an issue with forking on macOS), the file handles and everything else are happily copied to child processes (by "happily" I mean it's complicated in many edge cases such as forking threads, but still it works fine for file handles).
The following works fine for me:
import multiprocessing
this = open(__file__, 'r')
def read_file():
print(len(this.read()))
def main():
process = multiprocessing.Process(target=read_file)
process.start()
process.join()
if __name__ == '__main__':
main()
The problem is likely that tarfile has an internal structure and/or buffering while reading, also you can simply run into conflicts by trying to seek and read different parts of the same archive simultaneously. I.e., I'm speculating that using a threadpool without any synchronization is likely to run into exactly the same issues in this case.
Edit: to clarify, extracting a file from a Tar archive is likely (I haven't checked the exact details) done as follows: (1) seek to the offset of the encapsulated part (file), (2) read a chunk of the encapsulated file, write the chunk to the destination file (or pipe, or w/e), (3) repeat (2) until the whole file is extracted.
By attempting to do this in a non-synchronized way from parallel processes using the same file handle, will likely result in mixing of these steps, i.e. starting to process file #2 will seek away from file #1, while we are in the middle of reading file #1, etc.
Edit2 answering the comment below: Memory representation is forked afresh for child processes, that's true; but resources managed on the kernel side (such as file handles, and kernel buffers) are shared.
To illustrate:
import multiprocessing
this = open(__file__, 'rb')
def read_file(worker):
print(worker, this.read(80))
def main():
processes = []
for number in (1, 2):
processes.append(
multiprocessing.Process(target=read_file, args=(number,)))
for process in processes:
process.start()
for process in processes:
process.join()
if __name__ == '__main__':
main()
Running this on Linux I get:
$ python3.8 test.py
1 b"import multiprocessing\n\nthis = open(__file__, 'rb')\n\n\ndef read_file(worker):\n "
2 b''
If seeking and reading were independent, both processes would print an identical result, but they don't. Since this is a small file, and Python opts to buffer a small amount of data (8 KiB), the first process reads to the EOF, and the second process has no data left to read (unless it of course seeks back).
Related
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I'm working on a python 2.7 program that performs these actions in parallel using multiprocessing:
reads a line from file 1 and file 2 at the same time
applies function(line_1, line_2)
writes the function output to a file
I am new to multiprocessing and I'm not extremely expert with python in general. Therefore, I read a lot of already asked questions and tutorials: I feel close to the point but I am now probably missing something that I can't really spot.
The code is structured like this:
from itertools import izip
from multiprocessing import Queue, Process, Lock
nthreads = int(mp.cpu_count())
outq = Queue(nthreads)
l = Lock()
def func(record_1, record_2):
result = # do stuff
outq.put(result)
OUT = open("outputfile.txt", "w")
IN1 = open("infile_1.txt", "r")
IN2 = open("infile_2.txt", "r")
processes = []
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2))
processes.append(proc)
proc.start()
for proc in processes:
proc.join()
while (not outq.empty()):
l.acquire()
item = outq.get()
OUT.write(item)
l.release()
OUT.close()
IN1.close()
IN2.close()
To my understanding (so far) of multiprocessing as package, what I'm doing is:
creating a queue for the results of the function that has a size limit compatible with the number of cores of the machine.
filling this queue with the results of func().
reading the queue items until the queue is empty, writing them to the output file.
Now, my problem is that when I run this script it immediately becomes a zombie process. I know that the function works because without the multiprocessing implementation I had the results I wanted.
I'd like to read from the two files and write to output at the same time, to avoid generating a huge list from my input files and then reading it (input files are huge). Do you see anything gross, completely wrong or improvable?
The biggest issue I see is that you should pass the queue object through the process instead of trying to use it as a global in your function.
def func(record_1, record_2, queue):
result = # do stuff
queue.put(result)
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2, outq))
Also, as currently written, you would still be pulling all that information into memory (aka the queue) and waiting for the read to finish before writing to the output file. You need to move the p.join loop until after reading through the queue, and instead of putting all the information in the queue at the end of the func it should be filling the queue with chucks in a loop over time, or else it's the same as just reading it all into memory.
You also don't need a lock unless you are using it in the worker function func, and if you do, you will again want to pass it through.
If you want to not to read / store a lot in memory, I would write out the same time I am iterating through the input files. Here is a basic example of combining each line of the files together.
with open("infile_1.txt") as infile1, open("infile_2.txt") as infile2, open("out", "w") as outfile:
for line1, line2 in zip(infile1, infile2):
outfile.write(line1 + line2)
I don't want to write to much about all of these, just trying to give you ideas. Let me know if you want more detail about something. Hope it helps!
I'm using h5py to iteratively write to a large array with python. It takes quite a long time and I can watch the file size grow as the code is running.
Unfortunately, wenn my python programm exits, the file content disappears. The file is not corrupt, but all values are 0.0 (the fill value I set).
I made sure the file f is closed with f.close(), and after closing the file (but before exiting the program), the file was still intact and content was there.
Is anyone familiar with this behaviour and can explain what happens there? I'd appreciate any help!
To give you a bit more information, here is what I do specifically. I created a Process that processes results from a Queue. When the process is initialised, the HDF5 file is created, and when the last item in the queue is reached, the file is closed. All of this seems to work fine (as described above) but I'm mentioning it as I don't have a lot of experience with processes and wondering if the file handling in the process class could be the problem.
from multiprocessing import Process, Queue
import h5py
class ResultProcessor(Process):
def __init__(self, result_queue, result_file):
Process.__init__(self)
self.result_queue = result_queue
self.daemon = True
#open result file handle ('w')
self.f = h5py.File(result_file, 'w')
self.dset = self.f.create_dataset('zipped', (num_jobs, num_subjects), compression="gzip", fillvalue=0)
def run(self):
while True:
next_result = self.result_queue.get()
if next_result is None:
# Poison pill means we should exit
self.f.close()
return
idx, result = next_result
self.dset[idx,:] = result
The process is then initialised and run as below:
# results_queue is still empty
result_processor = ResultProcessor(results_queue, file_name)
result_processor.start()
# now the result queue is filled
process_stuff_and_feed_to_result_queue()
# add last queue item so the end can be recognised:
result_queue.put(None)
result_processor.join()
# I checked at this point: The file content is still around!
While this won't solve why the contents of the file appears to disappear, you should keep in mind that HDF5 (and hence h5py) is not designed to write to have multiple programs (using multiprocessing usually falls under this) writing to the same file. There is MPI support and SWMR (single writer multiple reader) in 1.10, but you don't have complete freedom to write anything in any order.
I'm converting a program to multiprocessing and need to be able to log to a single rotating log from the main process as well as subprocesses. I'm trying to use the 2nd example in the python cookbook Logging to a single file from multiple processes, which starts a logger_thread running as part of the main process, picking up log messages off a queue that the subprocesses add to. The example works well as is, and also works if I switch to a RotatingFileHandler.
However if I change it to start logger_thread before the subprocesses (so that I can log from the main process as well), then as soon as the log rotates, all subsequent logging generates a traceback with WindowsError: [Error 32] The process cannot access the file because it is being used by another process.
In other words I change this code from the 2nd example
workers = []
for i in range(5):
wp = Process(target=worker_process, name='worker %d' % (i + 1), args=(q,))
workers.append(wp)
wp.start()
logging.config.dictConfig(d)
lp = threading.Thread(target=logger_thread, args=(q,))
lp.start()
to this:
logging.config.dictConfig(d)
lp = threading.Thread(target=logger_thread, args=(q,))
lp.start()
workers = []
for i in range(5):
wp = Process(target=worker_process, name='worker %d' % (i + 1), args=(q,))
workers.append(wp)
wp.start()
and swap out logging.FileHandler for logging.handlers.RotatingFileHandler (with a very small maxBytes for testing) and then I hit this error.
I'm using Windows and python 2.7. QueueHandler is not part of stdlib til python 3.2 but I've copied the source code from Gist, which it says is safe to do.
I don't understand why starting the listener first would make any difference, nor do I understand why any process other than main would be attempting to access the file.
You should never start any threads before subprocesses. When Python forks, the threads and IPC state will not always be copied properly.
There are several resources on this, just google for fork and threads. Some people claim they can do it, but it's not clear to me that it can ever work properly.
Just start all your processes first.
Example additional information:
Status of mixing multiprocessing and threading in Python
https://stackoverflow.com/a/6079669/4279
In your case, it might be that the copied open file handle is the problem, but you still should start your subprocesses before your threads (and before you open any files that you will later want to destroy).
Some rules of thumb, summarized by fantabolous from the comments:
Subprocesses must always be started before any threads created by the same process.
multiprocessing.Pool creates both subprocesses AND threads, so one mustn't create additional Processes or Pools after the first one.
Files should not already be open at the time a Process or Pool is created. (This is OK in some cases, but not, e.g. if a file will be deleted later.)
Subprocesses can create their own threads and processes, with the same rules above applying.
Starting all processes first is the easiest way to do this
So, you can simply make your own file log handler. I have yet to see logs getting garbled from multiprocessing, so it seems file log rotation is the big issue. Just do this in your main, and you don't have to change any of the rest of your logging
import logging
import logging.handlers
from multiprocessing import RLock
class MultiprocessRotatingFileHandler(logging.handlers.RotatingFileHandler):
def __init__(self, *kargs, **kwargs):
super(MultiprocessRotatingFileHandler, self).__init__(*kargs, **kwargs)
self.lock = RLock()
def shouldRollover(self, record):
with self.lock:
super(MultiprocessRotatingFileHandler, self).shouldRollover(record)
file_log_path = os.path.join('var','log', os.path.basename(__file__) + '.log')
file_log = MultiprocessRotatingFileHandler(file_log_path,
maxBytes=8*1000*1024,
backupCount=5,
delay=True)
logging.basicConfig(level=logging.DEBUG)
logging.addHandler(file_log)
I'm willing to guess that locking every time you try to rotate is probably slowing down logging, but then this is a case where we need to sacrifice performance for correctness.
This question already has answers here:
Python subprocess in parallel
(4 answers)
Closed 8 years ago.
I have the following code that writes the md5sums to a logfile
for file in files_output:
p=subprocess.Popen(['md5sum',file],stdout=logfile)
p.wait()
Will these be written in parallel? i.e. if md5sum takes a long time for one of the files, will another one be started before waiting for a previous one to complete?
If the answer to the above is yes, can I assume the order of the md5sums written to logfile may differ based upon how long md5sum takes for each file? (some files can be huge, some small)
Yes, these md5sum processes will be started in parallel.
Yes, the order of md5sums writes will be unpredictable. And generally it is considered a bad practice to share a single resource like file from many processes this way.
Also your way of making p.wait() after the for loop will wait just for the last of md5sum processes to finish and the rest of them might still be running.
But you can modify this code slightly to still have benefits of parallel processing and predictability of synchronized output if you collect the md5sum output into temporary files and collect it back into one file once all processes are done.
import subprocess
import os
processes = []
for file in files_output:
f = os.tmpfile()
p = subprocess.Popen(['md5sum',file],stdout=f)
processes.append((p, f))
for p, f in processes:
p.wait()
f.seek(0)
logfile.write(f.read())
f.close()
All sub processes are run in parallel. (To avoid this one has to wait explicitly for their completion.) They even can write into the log file at the same time, thus garbling the output. To avoid this you should let each process write into a different logfile and collect all outputs when all processes are finished.
q = Queue.Queue()
result = {} # used to store the results
for fileName in fileNames:
q.put(fileName)
def worker():
while True:
fileName = q.get()
if fileName is None: # Sentinel?
return
subprocess_stuff_using(fileName)
wait_for_finishing_subprocess()
checksum = collect_md5_result_for(fileName)
result[fileName] = checksum # store it
threads = [ threading.Thread(target=worker) for _i in range(20) ]
for thread in threads:
thread.start()
q.put(None) # one Sentinel marker for each thread
After this the results should be stored in result.
A simple way to collect output from parallel md5sum subprocesses is to use a thread pool and write to the file from the main process:
from multiprocessing.dummy import Pool # use threads
from subprocess import check_output
def md5sum(filename):
try:
return check_output(["md5sum", filename]), None
except Exception as e:
return None, e
if __name__ == "__main__":
p = Pool(number_of_processes) # specify number of concurrent processes
with open("md5sums.txt", "wb") as logfile:
for output, error in p.imap(md5sum, filenames): # provide filenames
if error is None:
logfile.write(output)
the output from md5sum is small so you can store it in memory
imap preserves order
number_of_processes may be different from number of files or CPU cores (larger values doesn't mean faster: it depends on relative performance of IO (disks) and CPU)
You can try to pass several files at once to the md5sum subprocesses.
You don't need external subprocess in this case; you can calculate md5 in Python:
import hashlib
from functools import partial
def md5sum(filename, chunksize=2**15, bufsize=-1):
m = hashlib.md5()
with open(filename, 'rb', bufsize) as f:
for chunk in iter(partial(f.read, chunksize), b''):
m.update(chunk)
return m.hexdigest()
To use multiple processes instead of threads (to allow the pure Python md5sum() to run in parallel utilizing multiple CPUs) just drop .dummy from the import in the above code.