Read/write files between methods

Read/write files between methods - python

I have two functions, A and B:
def A():
if (exists("temp/my_file.txt"):
my_file = open("temp/my_file.txt", "r")
# Do stuff
my_file.close()
remove("temp/my_file.txt")
def B():
my_file = open("temp/my_file.txt", "w")
# Do other stuff
my_file.close()
These functions are written to separate scripts and run independently. B() creates the file which A() is supposed to read and then delete. However, I run into various problems with this (Error 2 for instance). I've tried using posixfile which doesn't work in Windows, Lockfile which gives me import errors, and I've tried writing the file to a temporary directory while I'm using it in B() and then moving it back when I want A() to read and delete it.
Could I get some insight into what's going on and how I could fix this?

While it's quite possible for one to read to a file that's being written to by another thread, it's not possible for either of the threads to delete the file without causing an error.
For example if the read thread deleted the file, the write operations would fail because the file no longer exists. This sounds like a situation where you are better off using a simple message queue, the simplest probably being redis lpush/rpop really, it's a lot easier than file io.
If you were to do this with redis,
import redis
def A():
rdb = redis.Redis()
while True:
item = rdb.rpop('somekey')
# do stuff
and the writer becomes
import redis
def B():
rdb = redis.Redis()
while True:
# do stuff
rdb.lpush('somekey',item)
rpop (and lpop) will wait for data to become available. If you want to stop the loop push in some special value as a signal

Related

Python multitprocessing to process files

I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.

As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,

Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.

You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.

Python atomic logs

I have a logger running on a few thousand processes, and they all write to the same file in append-mode. What would be a good way to guarantee that writes are atomic -- that is, each time a process writes to a log its entire contents are written in one block and there's no other process that writes to that file at the same time?
My thought was doing something like:
logger = getLogger()
global_lockfile = '/tmp/loglock'
def atomic_log(msg):
while True:
if os.path.exists(lockfile):
continue
with open(lockfile, 'w') as f:
logger.write(msg)
os.remove(lockfile)
def some_function(request):
log_atomic("Hello")
What would be an actual way to do the above on a posix system?

Multithreading makes me get the "ValueError: I/O operation on closed file" error. Why?

I am writing a Flask Web Application using WTForms. In one of the forms the user should upload a csv file and the server will analyze the received data. This is the code I am using.
filename = token_hex(8) + '.csv' # Generate a new filename
form.dataset.data.save('myapp/datasets/' + filename) # Save the received file
dataset = genfromtxt('myapp/datasets/' + filename, delimiter=',') # Open the newly generated file
# analyze 'dataset'
As long as I was using this code inside a single-thread application everything was working. I tried adding a thread in the code. Here's the procedure called by the thread (the same exact code inside a function):
def execute_analysis(form):
filename = token_hex(8) + '.csv' # Generate a new filename
form.dataset.data.save('myapp/datasets/' + filename) # Save the received file
dataset = genfromtxt('myapp/datasets/' + filename, delimiter=',') # Open the newly generated file
# analyze 'dataset'
and here's how I call the thread
import threading
#posts.route("/estimation", methods=['GET', 'POST'])
#login_required
def estimate_parameters():
form = EstimateForm()
if form.validate_on_submit():
threading.Thread(target=execute_analysis, args=[form]).start()
flash("Your request has been received. Please check the site in again in a few minutes.", category='success')
# return render_template('posts/post.html', title=post.id, post=post)
return render_template('estimations/estimator.html', title='New Analysis', form=form, legend='New Analysis')
But now I get the following error:
ValueError: I/O operation on closed file.
Relative to the save function call. Why is it not working? How should I fix this?

It's hard to tell without further context, but I suspect it's likely that you're returning from a function or exiting a context manager which causes some file descriptor to close, and hence causes the save(..) call to fail with ValueError.
If so, one direct fix would be to wait for the thread to finish before returning/closing the file. Something along the lines of:
def handle_request(form):
...
analyzer_thread = threading.Thread(target=execute_analysis, args=[form])
analyzer_thread.start()
...
analyzer_thread.join() # wait for completion of execute_analysis
cleanup_context(form)
return
Here is a reproducable minimal example of the problem I am describing:
import threading
SEM = threading.Semaphore(0)
def run(fd):
SEM.acquire() # wait till release
fd.write("This will fail :(")
fd = open("test.txt", "w+")
other_thread = threading.Thread(target=run, args=[fd])
other_thread.start()
fd.close()
SEM.release() # release the semaphore, so other_thread will acquire & proceed
other_thread.join()
Note that the main thread will close the file, and the other thread will fail on write call with ValueError: I/O operation on closed file., as in your case.

I don't know the framework sufficiently to tell exactly what happened, but I can tell you how you probably can fix it.
Whenever you have a resource that is shared by multiple threads, use a lock.
from threading import Lock
LOCK = Lock()
def process():
LOCK.acquire()
... # open a file, write some data to it etc.
LOCK.release()
# alternatively, use the context manager syntax
with LOCK:
...
threading.Thread(target=process).start()
threading.Thread(target=process).start()
Documentation on threading.Lock:
The class implementing primitive lock objects. Once a thread has acquired a lock, subsequent attempts to acquire it block, until it is released
Basically, after thread 1 calls LOCK.acquire(), subsequent calls e.g. from other threads, will cause those threads to freeze and wait until something calls LOCK.release() (usually thread 1, after it finishes its business with the resource).
If the filenames are randomly generated then I wouldn't expect problems with 1 thread closing the other's file, unless both of them happen to generate the same name. But perhaps you can figure it out with some experimentation, e.g. first try locking calls to both save and genfromtxt and check if that helps. It might also make sense to add some print statements (or even better, use logging), e.g. to check if the file names don't collide.

Sequentually unpickle large files asynchronously

I have a directory of Pickled lists which I would like to load sequentially, use as part of an operation, and then discard. The files are around 0.75 - 2GB each when pickled and I can load a number in memory at any one time, although nowhere near all of them. Each pickled file represents one day of data.
Currently, the unpickling process consumes a substantial proportion of the runtime of the program. My proposed solution is to load the first file and, whilst the operation is running on this file, asynchronously load the next file in the list.
I have thought of two ways I could do this: 1) Threading and 2) Asyncio. I have tried both of these but neither has seemed to work. Below is my (attempted) implementation of a Threading-based solution.
import os
import threading
import pickle
class DataSource:
def __init__(self, folder):
self.folder = folder
self.next_file = None
def get(self):
if self.next_file is None:
self.load_file()
data = self.next_file
io_thread = threading.Thread(target=self.load_file, daemon=True)
io_thread.start()
return data
def get_next_file(self):
for filename in sorted(os.listdir(self.folder)):
yield self.folder + filename
def load_file(self):
self.next_file = pickle.load(open(next(self.get_next_file()), "rb"))
The main program will call DataSource().get() to retrieve each file. The first time it is loaded, load_file() will load the file into next_file where it will be stored. Then, the thread io_thread should load each successive file into next_file to be returned via get() as needed.
The thread that is launched does appear to do some work (it consumes a vast amount of RAM, ~60GB) however it does not appear to update next_file.
Could someone suggest why this doesn't work? And, additionally, if there is a better way to achieve this result?
Thanks

DataSource().get() seems to be your first problem: that means you create a new instance of the DataSource class always, and only ever get to load the first file, because you never call the same DataSource object instance again so that you'd proceed to the next file. Maybe you mean to do along the lines of:
datasource = DataSource()
while datasource.not_done():
datasource.get()
It would be useful to share the full code, and preferrably on repl.it or somewhere where it can be executed.
Also, if you want better performance, I might be worthwhile to look into the multiprocessing module, as Python blocks some operations with the global interpreter lock (GIL) so that only one thread runs at a time, even when you have multiple CPU cores. That might not be a problem though in your case as reading from disk is probably the bottleneck, I'd guess Python releases the lock while executing underlying native code to read from filesystem.
I'm also curious about how you could use asyncio for pickles .. I guess you read the pickle to mem from a file first, and then unpickle when it's done, while doing other processing during the loading. That seems like it could work nicely.
Finally, I'd add debug prints to see what's going on.
Update: Next problem seems to be that you are using the get_next_file generator wrongly. There you create a new generator each time with self.get_next_file(), so you only ever load the first file. You should only create the generator once and then call next() on it. Maybe this helps to understand, is also on replit:
def get_next_file():
for filename in ['a', 'b', 'c']:
yield filename
for n in get_next_file():
print(n)
print("---")
print(next(get_next_file()))
print(next(get_next_file()))
print(next(get_next_file()))
print("---")
gen = get_next_file()
print(gen)
print(next(gen))
print(next(gen))
print(next(gen))
Output:
a
b
c
---
a
a
a
---
<generator object get_next_file at 0x7ff4757f6cf0>
a
b
c
https://repl.it/#ToniAlatalo/PythonYieldNext#main.py
Again, debug prints would help you see what's going on, what file you are loading when etc.

Multiple access to mmap objects in python

I have a number of files, mapped to memory (as mmap objects). In course of their processing each file must be opened several times. It works fine, if there is only one thread. However, when I try to run the task in parallel, a problem arises: different threads cannot access the same file simultaneously. The problem is illustrated by this sample:
import mmap, threading
class MmapReading(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
for i in range(10000):
content = mmap_object.read().decode('utf-8')
mmap_object.seek(0)
if not content:
print('Error while reading mmap object')
with open('my_dummy_file.txt', 'w') as f:
f.write('Hello world')
with open('my_dummy_file.txt', 'r') as f:
mmap_object = mmap.mmap(f.fileno(), 0, prot = mmap.PROT_READ)
threads = []
for i in range(64):
threads.append(MmapReading())
threads[i].daemon = True
threads[i].start()
for thread in threading.enumerate():
if thread != threading.current_thread():
thread.join()
print('Mmap reading testing done!')
Whenever I run this script, I get around 20 error messages.
Is there a way to circumvent this problem, other then making 64 copies of each file (which would consume too much memory in my case)?

The seek(0) is not always performed before another thread jumps in and performs a read().
Say thread 1 performs a read, reading to end of file; seek(0) has
not yet been executed.
Then thread 2 executes a read. The file pointer in the mmap is still
at the end of the file. read() therefore returns ''.
The error detection code is triggered because content is ''.
Instead of using read(), you can use slicing to achieve the same result. Replace:
content = mmap_object.read().decode('utf-8')
mmap_object.seek(0)
with
content = mmap_object[:].decode('utf8')
content = mmap_object[:mmap_object.size()] also works.
Locking is another way, but it's unnecessary in this case. If you want to try it, you can use a global threading.Lock object and pass that to MmapReading when instantiating. Store the lock object in an instance variable self.lock. Then call self.lock.acquire() before reading/seeking, and self.lock.release() afterwards. You'll experience a very noticeable performance penalty doing this.
from threading import Lock
class MmapReading(threading.Thread):
def __init__(self, lock):
self.lock = lock
threading.Thread.__init__(self)
def run(self):
for i in range(10000):
self.lock.acquire()
mmap_object.seek(0)
content = mmap_object.read().decode('utf-8')
self.lock.release()
if not content:
print('Error while reading mmap object')
lock = Lock()
for i in range(64):
threads.append(MmapReading(lock))
.
.
.
Note that I've changed the order of the read and the seek; it makes more sense to do the seek first, positioning the file pointer at the start of the file.

I fail to see where you need mmap to begin with. mmap is a technique to share data between processes. Why don't you just read the contents into memory (once!) e.g. as list? Each thread will be then accessing the list with it's own set of iterators. Also, be aware of the GIL in Python which prevents any speedup from happening using multithreading. If you want that, use multiprocessing (and then a mmaped file makes sense, but is actually shared amongst the various processes)

The issue is that the single mmap_object is being shared among the threads so that thread A calls read and before it gets to the seek, thread B also calls read, and so gets no data.
What you really need is an ability to duplicate the python mmap object without duplicating the underlying mmap, but I see no way of doing that.
I think the only feasible solution short of rewriting the object implementation is to employ a lock (mutex, etc) per mmap object to prevent two threads from accessing the same object at the same time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read/write files between methods - python

Related

Python multitprocessing to process files

Python atomic logs

Multithreading makes me get the "ValueError: I/O operation on closed file" error. Why?

Sequentually unpickle large files asynchronously

Multiple access to mmap objects in python

Categories

Resources