I have a logger running on a few thousand processes, and they all write to the same file in append-mode. What would be a good way to guarantee that writes are atomic -- that is, each time a process writes to a log its entire contents are written in one block and there's no other process that writes to that file at the same time?
My thought was doing something like:
logger = getLogger()
global_lockfile = '/tmp/loglock'
def atomic_log(msg):
while True:
if os.path.exists(lockfile):
continue
with open(lockfile, 'w') as f:
logger.write(msg)
os.remove(lockfile)
def some_function(request):
log_atomic("Hello")
What would be an actual way to do the above on a posix system?
Related
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
A few days back I answered a question on SO regarding reading a tar file in parallel.
This was the gist of the question:
import bz2
import tarfile
from multiprocessing import Pool
tr = tarfile.open('data.tar')
def clean_file(tar_file_entry):
if '.bz2' not in str(tar_file_entry):
return
with tr.extractfile(tar_file_entry) as bz2_file:
with bz2.open(bz2_file, "rt") as bzinput:
# Reading bz2 file
....
....
def process_serial():
members = tr.getmembers()
processed_files = []
for i, member in enumerate(members):
processed_files.append(clean_file(member))
print(f'done {i}/{len(members)}')
def process_parallel():
members = tr.getmembers()
with Pool() as pool:
processed_files = pool.map(clean_file, members)
print(processed_files)
def main():
process_serial() # No error
process_parallel() # Error
if __name__ == '__main__':
main()
We were able to make the error disappear by just opening the tar file inside the child process rather than in the parent, as mentioned in the answer.
I am not able to understand why did this work.
Even if we open the tarfile in the parent process, the child process will get a new copy.
So why does opening the tarfile in the child process explicitly make any difference?
Does this mean that in the first case, the child processes were somehow mutating the common tarfile object and causing memory corruption due to concurrent writes?
FWIW, the answer in the comments wrt open is actually incorrect on UNIX-like systems regarding file handle numbers.
If multiprocessing uses fork() (which it does under Linux and similar, although I read there was an issue with forking on macOS), the file handles and everything else are happily copied to child processes (by "happily" I mean it's complicated in many edge cases such as forking threads, but still it works fine for file handles).
The following works fine for me:
import multiprocessing
this = open(__file__, 'r')
def read_file():
print(len(this.read()))
def main():
process = multiprocessing.Process(target=read_file)
process.start()
process.join()
if __name__ == '__main__':
main()
The problem is likely that tarfile has an internal structure and/or buffering while reading, also you can simply run into conflicts by trying to seek and read different parts of the same archive simultaneously. I.e., I'm speculating that using a threadpool without any synchronization is likely to run into exactly the same issues in this case.
Edit: to clarify, extracting a file from a Tar archive is likely (I haven't checked the exact details) done as follows: (1) seek to the offset of the encapsulated part (file), (2) read a chunk of the encapsulated file, write the chunk to the destination file (or pipe, or w/e), (3) repeat (2) until the whole file is extracted.
By attempting to do this in a non-synchronized way from parallel processes using the same file handle, will likely result in mixing of these steps, i.e. starting to process file #2 will seek away from file #1, while we are in the middle of reading file #1, etc.
Edit2 answering the comment below: Memory representation is forked afresh for child processes, that's true; but resources managed on the kernel side (such as file handles, and kernel buffers) are shared.
To illustrate:
import multiprocessing
this = open(__file__, 'rb')
def read_file(worker):
print(worker, this.read(80))
def main():
processes = []
for number in (1, 2):
processes.append(
multiprocessing.Process(target=read_file, args=(number,)))
for process in processes:
process.start()
for process in processes:
process.join()
if __name__ == '__main__':
main()
Running this on Linux I get:
$ python3.8 test.py
1 b"import multiprocessing\n\nthis = open(__file__, 'rb')\n\n\ndef read_file(worker):\n "
2 b''
If seeking and reading were independent, both processes would print an identical result, but they don't. Since this is a small file, and Python opts to buffer a small amount of data (8 KiB), the first process reads to the EOF, and the second process has no data left to read (unless it of course seeks back).
I'm using h5py to iteratively write to a large array with python. It takes quite a long time and I can watch the file size grow as the code is running.
Unfortunately, wenn my python programm exits, the file content disappears. The file is not corrupt, but all values are 0.0 (the fill value I set).
I made sure the file f is closed with f.close(), and after closing the file (but before exiting the program), the file was still intact and content was there.
Is anyone familiar with this behaviour and can explain what happens there? I'd appreciate any help!
To give you a bit more information, here is what I do specifically. I created a Process that processes results from a Queue. When the process is initialised, the HDF5 file is created, and when the last item in the queue is reached, the file is closed. All of this seems to work fine (as described above) but I'm mentioning it as I don't have a lot of experience with processes and wondering if the file handling in the process class could be the problem.
from multiprocessing import Process, Queue
import h5py
class ResultProcessor(Process):
def __init__(self, result_queue, result_file):
Process.__init__(self)
self.result_queue = result_queue
self.daemon = True
#open result file handle ('w')
self.f = h5py.File(result_file, 'w')
self.dset = self.f.create_dataset('zipped', (num_jobs, num_subjects), compression="gzip", fillvalue=0)
def run(self):
while True:
next_result = self.result_queue.get()
if next_result is None:
# Poison pill means we should exit
self.f.close()
return
idx, result = next_result
self.dset[idx,:] = result
The process is then initialised and run as below:
# results_queue is still empty
result_processor = ResultProcessor(results_queue, file_name)
result_processor.start()
# now the result queue is filled
process_stuff_and_feed_to_result_queue()
# add last queue item so the end can be recognised:
result_queue.put(None)
result_processor.join()
# I checked at this point: The file content is still around!
While this won't solve why the contents of the file appears to disappear, you should keep in mind that HDF5 (and hence h5py) is not designed to write to have multiple programs (using multiprocessing usually falls under this) writing to the same file. There is MPI support and SWMR (single writer multiple reader) in 1.10, but you don't have complete freedom to write anything in any order.
I have two functions, A and B:
def A():
if (exists("temp/my_file.txt"):
my_file = open("temp/my_file.txt", "r")
# Do stuff
my_file.close()
remove("temp/my_file.txt")
def B():
my_file = open("temp/my_file.txt", "w")
# Do other stuff
my_file.close()
These functions are written to separate scripts and run independently. B() creates the file which A() is supposed to read and then delete. However, I run into various problems with this (Error 2 for instance). I've tried using posixfile which doesn't work in Windows, Lockfile which gives me import errors, and I've tried writing the file to a temporary directory while I'm using it in B() and then moving it back when I want A() to read and delete it.
Could I get some insight into what's going on and how I could fix this?
While it's quite possible for one to read to a file that's being written to by another thread, it's not possible for either of the threads to delete the file without causing an error.
For example if the read thread deleted the file, the write operations would fail because the file no longer exists. This sounds like a situation where you are better off using a simple message queue, the simplest probably being redis lpush/rpop really, it's a lot easier than file io.
If you were to do this with redis,
import redis
def A():
rdb = redis.Redis()
while True:
item = rdb.rpop('somekey')
# do stuff
and the writer becomes
import redis
def B():
rdb = redis.Redis()
while True:
# do stuff
rdb.lpush('somekey',item)
rpop (and lpop) will wait for data to become available. If you want to stop the loop push in some special value as a signal
I am using mpi4py to model a distributed application.
I have n processes accessing a shared file and writing some logs into the shared file during their execution. I notice that the logs are not uniformly written. Here is an example of how logs are written into the shared file:
process0.log0
process0.log1
process0.log2
process0.log3
process0.log4
process2.log0
process2.log1
process2.log2
process1.log0
process1.log1
Ideally it should be like:
process0.log0
process1.log0
process2.log0
process0.log1
process2.log1
process1.log1
process0.log2
Can anyone tell me what is possibly wrong with my implementation? I am writing into the file using Pickle module.
following is the function which dumps the log:
import pickle
log_file_name = "store.log"
def writeLog(data):
try:
with open(log_file_name,"a") as fp:
pickle.dump(obj=data,file=fp)
except:
with open(log_file_name,"w") as fp:
pickle.dump(obj=data,file=fp)
def readLog():
data = []
try:
with open(log_file_name,"r") as fp:
while True:
data.append(pickle.load(fp))
return data
except EOFError:
return data
All n processes access this function to dump the data
There are lots of questions/answers out there that explain the phenomenon you're seeing here:
MPI - Printing in an order
Using MPI, a message appears to have been recieved before it has been sent
how do I print the log in order in MPI
Why does this MPI code execute out of order?)
Redirecting stdout from children spawned via MPI_Comm_spawn
Even though these are (mostly) talking about printing to the screen, the problem is the same. MPI is a distributed model which means that some processes will execute faster than others and it will probably be a different order every time depending on the workload/ordering of each process.
If ordering is important, you can use synchronization functions to enforce it or you can use something more fancy like MPI I/O for writing to files (not my specialty so I can't tell you much more about it).