Sequentually unpickle large files asynchronously

Sequentually unpickle large files asynchronously - python

I have a directory of Pickled lists which I would like to load sequentially, use as part of an operation, and then discard. The files are around 0.75 - 2GB each when pickled and I can load a number in memory at any one time, although nowhere near all of them. Each pickled file represents one day of data.
Currently, the unpickling process consumes a substantial proportion of the runtime of the program. My proposed solution is to load the first file and, whilst the operation is running on this file, asynchronously load the next file in the list.
I have thought of two ways I could do this: 1) Threading and 2) Asyncio. I have tried both of these but neither has seemed to work. Below is my (attempted) implementation of a Threading-based solution.
import os
import threading
import pickle
class DataSource:
def __init__(self, folder):
self.folder = folder
self.next_file = None
def get(self):
if self.next_file is None:
self.load_file()
data = self.next_file
io_thread = threading.Thread(target=self.load_file, daemon=True)
io_thread.start()
return data
def get_next_file(self):
for filename in sorted(os.listdir(self.folder)):
yield self.folder + filename
def load_file(self):
self.next_file = pickle.load(open(next(self.get_next_file()), "rb"))
The main program will call DataSource().get() to retrieve each file. The first time it is loaded, load_file() will load the file into next_file where it will be stored. Then, the thread io_thread should load each successive file into next_file to be returned via get() as needed.
The thread that is launched does appear to do some work (it consumes a vast amount of RAM, ~60GB) however it does not appear to update next_file.
Could someone suggest why this doesn't work? And, additionally, if there is a better way to achieve this result?
Thanks

DataSource().get() seems to be your first problem: that means you create a new instance of the DataSource class always, and only ever get to load the first file, because you never call the same DataSource object instance again so that you'd proceed to the next file. Maybe you mean to do along the lines of:
datasource = DataSource()
while datasource.not_done():
datasource.get()
It would be useful to share the full code, and preferrably on repl.it or somewhere where it can be executed.
Also, if you want better performance, I might be worthwhile to look into the multiprocessing module, as Python blocks some operations with the global interpreter lock (GIL) so that only one thread runs at a time, even when you have multiple CPU cores. That might not be a problem though in your case as reading from disk is probably the bottleneck, I'd guess Python releases the lock while executing underlying native code to read from filesystem.
I'm also curious about how you could use asyncio for pickles .. I guess you read the pickle to mem from a file first, and then unpickle when it's done, while doing other processing during the loading. That seems like it could work nicely.
Finally, I'd add debug prints to see what's going on.
Update: Next problem seems to be that you are using the get_next_file generator wrongly. There you create a new generator each time with self.get_next_file(), so you only ever load the first file. You should only create the generator once and then call next() on it. Maybe this helps to understand, is also on replit:
def get_next_file():
for filename in ['a', 'b', 'c']:
yield filename
for n in get_next_file():
print(n)
print("---")
print(next(get_next_file()))
print(next(get_next_file()))
print(next(get_next_file()))
print("---")
gen = get_next_file()
print(gen)
print(next(gen))
print(next(gen))
print(next(gen))
Output:
a
b
c
---
a
a
a
---
<generator object get_next_file at 0x7ff4757f6cf0>
a
b
c
https://repl.it/#ToniAlatalo/PythonYieldNext#main.py
Again, debug prints would help you see what's going on, what file you are loading when etc.

Related

Python multitprocessing to process files

I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.

As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,

Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.

You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.

Share variables across scripts

I have 2 separate scripts working with the same variables.
To be more precise, one code edits the variables and the other one uses them (It would be nice if it could edit them too but not absolutely necessary.)
This is what i am currently doing:
When code 1 edits a variable it dumps it into a json file.
Code 2 repeatedly opens the json file to get the variables.
This method is really not elegant and the while loop is really slow.
How can i share variables across scripts?
My first scripts gets data from a midi controller and sends web-requests.
My second script is for LED strips (those run thanks to the same midi controller). Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down. I am currently just sharing the variables via a json file.
If enough people ask for it i will post the whole code but i have been told not to do this

Considering the information you provided, meaning...
Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down.
To me, you have 2 choices :
Use a client/server model. You have 2 machines. One acts as the server, and the second as the client. The server has a script with an infinite loop that consistently updates the data, and you would have an API that would just read and expose the current state of your file/database to the client. The client would be on another machine, and as I understand it, it would simply request the current data, and process it.
Make a single multiprocessing script. Each script would run on a separate 'thread' and would manage its own memory. As you also want to share variables between your two programs, you could pass as argument an object that would be shared between both your programs. See this resource to help you.
Note that there are more solutions to this. For instance, you're using a JSON file that you are consistently opening and closing (that is probably what takes the most time in your program). You could use a real Database that could handle being opened only once, and processed many times, while still being updated.

a Manager from multiprocessing lets you do this sort thing pretty easily
first I simplify your "midi controller and sends web-request" code down to something that just sleeps for random amounts of time and updates a variable in a managed dictionary:
from time import sleep
from random import random
def slow_fn(d):
i = 0
while True:
sleep(random() ** 2)
i += 1
d['value'] = i
next we simplify the "LED strip" control down to something that just prints to the screen:
from time import perf_counter
def fast_fn(d):
last = perf_counter()
while True:
sleep(0.05)
value = d.get('value')
now = perf_counter()
print(f'fast {value} {(now - last) * 1000:.2f}ms')
last = now
you can then run these functions in separate processes:
import multiprocessing as mp
with mp.Manager() as manager:
d = manager.dict()
procs = []
for fn in [slow_fn, fast_fn]:
p = mp.Process(target=fn, args=[d])
procs.append(p)
p.start()
for p in procs:
p.join()
the "fast" output happens regularly with no obvious visual pauses

Clean, pythonic way for concurrent data loaders?

Python 3
I would like to know what a really clean, pythonic concurrent data loader should look like. I need this approach for a project of mine that does heavy computations on data that is too big to entirely fit into memory. Hence, I implemented data loaders that should run concurrently and store data in a queue, so that the main process can work while (in the mean time) the next data is being loaded & prepared. Of course, the queue should block when it is empty (main process trying to consume more items -> queue should wait for new data) or full (worker process should wait until main process consumes data out of the queue to prevent out-of-memory errors).
I have written a class to fulfill this need using Python's multiprocessing module (multiprocessing.Queue and multiprocessing.Process). The crucial parts of the class are implemented as follows:
import multiprocessing as mp
from itertools import cycle
class ConcurrentLoader:
def __init__(path_to_data, queue_size, batch_size):
self._batch_size
self._path = path_to_data
filenames = ... # filenames for path 'path_to_data',
# get loaded using glob
self._files = cycle()
self._q = mp.Queue(queue_size)
...
self._worker = mp.Process(target=self._worker_func, daemon=True)
self._worker.start() # only started, never stopped
def _worker_func(self):
while True:
buffer = list()
for i in range(batch_size):
f = next(self._files)
... # load f and do some pre-processing with NumPy
... # add it to buffer
self._q.put(np.array(buffer).astype(np.float32))
def get_batch_data(self):
self._q.get()
The class has some more methods, but they are all for "convenience functionality". For example, it counts in a dict how often each file was loaded, how often the whole data set was loaded and so on, but these are rather easy to implement in Python and do not waste much computation time (sets, dicts, ...).
The data part itself on the other hand, due to I/O and pre-processing, can even take seconds. That is the reason why I want this to happen concurrently.
ConcurrentLoader should:
block main process: if get_batch_data is called, but queue is empty
block worker process: if queue is full, to prevent out-of-memory errors and prevent while True from wasting resources
be "transparent" to any class that uses ConcurrentLoader: they should just supply the path to the data and use get_batch_data without noticing that this actually works concurrently ("hassle free usage")
terminate its worker when main process dies to free resources again
Considering these goals (have I forgotten anything?) what should I do to enhance the current implementation? Is it thread/dead-lock safe? Is there a more "pythonic" way of implementation? Can I get it more clean? Does waste resources somehow?
Any class that uses ConcurrentLoader would roughly follow this setup:
class Foo:
...
def do_something(self):
...
data1 = ConcurrentLoader("path/to/data1", 64, 8)
data2 = ConcurrentLoader("path/to/data2", 256, 16)
...
sample1 = data1.get_batch_data()
sample2 = data2.get_batch_data()
... # heavy computations with data contained in 'sample1' & 'sample2'
# go *here*
Please either point out mistakes of any kind in order to improve my approach or supply an own, cleaner, more pythonic approach.

Blocking when a multiprocessing.Queue is empty/full and
get()/put() is called on it happens automatically.
This behavior is transparent to calling functions.
Use self._worker.daemon = True before self._worker.start() so the worker(s) will automatically be killed when main process exits

Autodesk's Fbx Python and threading

I'm trying to use the fbx python module from autodesk, but it seems I can't thread any operation. This seems due to the GIL not relased. Has anyone found the same issue or am I doing something wrong? When I say it doesn't work, I mean the code doesn't release the thread and I'm not be able to do anything else, while the fbx code is running.
There isn't much of code to post, just to know whether it did happen to anyone to try.
Update:
here is the example code, please note each fbx file is something like 2GB
import os
import fbx
import threading
file_dir = r'../fbxfiles'
def parse_fbx(filepath):
print '-' * (len(filepath) + 9)
print 'parsing:', filepath
manager = fbx.FbxManager.Create()
importer = fbx.FbxImporter.Create(manager, '')
status = importer.Initialize(filepath)
if not status:
raise IOError()
scene = fbx.FbxScene.Create(manager, '')
importer.Import(scene)
# freeup memory
rootNode = scene.GetRootNode()
def traverse(node):
print node.GetName()
for i in range(0, node.GetChildCount()):
child = node.GetChild(i)
traverse(child)
# RUN
traverse(rootNode)
importer.Destroy()
manager.Destroy()
files = os.listdir(file_dir)
tt = []
for file_ in files:
filepath = os.path.join(file_dir, file_)
t = threading.Thread(target=parse_fbx, args=(filepath,))
tt.append(t)
t.start()

One problem I see is with your traverse() function. It's calling itself recursively potentially a huge number of times. Another is having all the threads printing stuff at the same time. Doing that properly requires coordinating access to the shared output device (i.e. the screen). A simple way to do that is by creating and using a global threading.Lock object.
First create a global Lock to prevent threads from printing at same time:
file_dir = '../fbxfiles' # an "r" prefix needed only when path contains backslashes
print_lock = threading.Lock() # add this here
Then make a non-recursive version of traverse() that uses it:
def traverse(rootNode):
with print_lock:
print rootNode.GetName()
for i in range(node.GetChildCount()):
child = node.GetChild(i)
with print_lock:
print child.GetName()
It's not clear to me exactly where the reading of each fbxfile takes place. If it all happens as a result of the importer.Import(scene) call, then that is the only time any other threads will be given a chance to run — unless some I/O is [also] done within the traverse() function.
Since printing is most definitely a form of output, thread switching will also be able to occur when it's done. However, if all the function did was perform computations of some kind, no multi-threading would take place within it during its execution.
Once you get the multi-reading working, you may encounter insufficient memory issues if multiple 2GB fbxfiles are being read into memory simultaneously by the various different threads.

Multiple access to mmap objects in python

I have a number of files, mapped to memory (as mmap objects). In course of their processing each file must be opened several times. It works fine, if there is only one thread. However, when I try to run the task in parallel, a problem arises: different threads cannot access the same file simultaneously. The problem is illustrated by this sample:
import mmap, threading
class MmapReading(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
for i in range(10000):
content = mmap_object.read().decode('utf-8')
mmap_object.seek(0)
if not content:
print('Error while reading mmap object')
with open('my_dummy_file.txt', 'w') as f:
f.write('Hello world')
with open('my_dummy_file.txt', 'r') as f:
mmap_object = mmap.mmap(f.fileno(), 0, prot = mmap.PROT_READ)
threads = []
for i in range(64):
threads.append(MmapReading())
threads[i].daemon = True
threads[i].start()
for thread in threading.enumerate():
if thread != threading.current_thread():
thread.join()
print('Mmap reading testing done!')
Whenever I run this script, I get around 20 error messages.
Is there a way to circumvent this problem, other then making 64 copies of each file (which would consume too much memory in my case)?

The seek(0) is not always performed before another thread jumps in and performs a read().
Say thread 1 performs a read, reading to end of file; seek(0) has
not yet been executed.
Then thread 2 executes a read. The file pointer in the mmap is still
at the end of the file. read() therefore returns ''.
The error detection code is triggered because content is ''.
Instead of using read(), you can use slicing to achieve the same result. Replace:
content = mmap_object.read().decode('utf-8')
mmap_object.seek(0)
with
content = mmap_object[:].decode('utf8')
content = mmap_object[:mmap_object.size()] also works.
Locking is another way, but it's unnecessary in this case. If you want to try it, you can use a global threading.Lock object and pass that to MmapReading when instantiating. Store the lock object in an instance variable self.lock. Then call self.lock.acquire() before reading/seeking, and self.lock.release() afterwards. You'll experience a very noticeable performance penalty doing this.
from threading import Lock
class MmapReading(threading.Thread):
def __init__(self, lock):
self.lock = lock
threading.Thread.__init__(self)
def run(self):
for i in range(10000):
self.lock.acquire()
mmap_object.seek(0)
content = mmap_object.read().decode('utf-8')
self.lock.release()
if not content:
print('Error while reading mmap object')
lock = Lock()
for i in range(64):
threads.append(MmapReading(lock))
.
.
.
Note that I've changed the order of the read and the seek; it makes more sense to do the seek first, positioning the file pointer at the start of the file.

I fail to see where you need mmap to begin with. mmap is a technique to share data between processes. Why don't you just read the contents into memory (once!) e.g. as list? Each thread will be then accessing the list with it's own set of iterators. Also, be aware of the GIL in Python which prevents any speedup from happening using multithreading. If you want that, use multiprocessing (and then a mmaped file makes sense, but is actually shared amongst the various processes)

The issue is that the single mmap_object is being shared among the threads so that thread A calls read and before it gets to the seek, thread B also calls read, and so gets no data.
What you really need is an ability to duplicate the python mmap object without duplicating the underlying mmap, but I see no way of doing that.
I think the only feasible solution short of rewriting the object implementation is to employ a lock (mutex, etc) per mmap object to prevent two threads from accessing the same object at the same time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sequentually unpickle large files asynchronously - python

Related

Python multitprocessing to process files

Share variables across scripts

Clean, pythonic way for concurrent data loaders?

Autodesk's Fbx Python and threading

Multiple access to mmap objects in python

Categories

Resources