Multiple access to mmap objects in python - python

I have a number of files, mapped to memory (as mmap objects). In course of their processing each file must be opened several times. It works fine, if there is only one thread. However, when I try to run the task in parallel, a problem arises: different threads cannot access the same file simultaneously. The problem is illustrated by this sample:
import mmap, threading
class MmapReading(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
for i in range(10000):
content = mmap_object.read().decode('utf-8')
mmap_object.seek(0)
if not content:
print('Error while reading mmap object')
with open('my_dummy_file.txt', 'w') as f:
f.write('Hello world')
with open('my_dummy_file.txt', 'r') as f:
mmap_object = mmap.mmap(f.fileno(), 0, prot = mmap.PROT_READ)
threads = []
for i in range(64):
threads.append(MmapReading())
threads[i].daemon = True
threads[i].start()
for thread in threading.enumerate():
if thread != threading.current_thread():
thread.join()
print('Mmap reading testing done!')
Whenever I run this script, I get around 20 error messages.
Is there a way to circumvent this problem, other then making 64 copies of each file (which would consume too much memory in my case)?

The seek(0) is not always performed before another thread jumps in and performs a read().
Say thread 1 performs a read, reading to end of file; seek(0) has
not yet been executed.
Then thread 2 executes a read. The file pointer in the mmap is still
at the end of the file. read() therefore returns ''.
The error detection code is triggered because content is ''.
Instead of using read(), you can use slicing to achieve the same result. Replace:
content = mmap_object.read().decode('utf-8')
mmap_object.seek(0)
with
content = mmap_object[:].decode('utf8')
content = mmap_object[:mmap_object.size()] also works.
Locking is another way, but it's unnecessary in this case. If you want to try it, you can use a global threading.Lock object and pass that to MmapReading when instantiating. Store the lock object in an instance variable self.lock. Then call self.lock.acquire() before reading/seeking, and self.lock.release() afterwards. You'll experience a very noticeable performance penalty doing this.
from threading import Lock
class MmapReading(threading.Thread):
def __init__(self, lock):
self.lock = lock
threading.Thread.__init__(self)
def run(self):
for i in range(10000):
self.lock.acquire()
mmap_object.seek(0)
content = mmap_object.read().decode('utf-8')
self.lock.release()
if not content:
print('Error while reading mmap object')
lock = Lock()
for i in range(64):
threads.append(MmapReading(lock))
.
.
.
Note that I've changed the order of the read and the seek; it makes more sense to do the seek first, positioning the file pointer at the start of the file.

I fail to see where you need mmap to begin with. mmap is a technique to share data between processes. Why don't you just read the contents into memory (once!) e.g. as list? Each thread will be then accessing the list with it's own set of iterators. Also, be aware of the GIL in Python which prevents any speedup from happening using multithreading. If you want that, use multiprocessing (and then a mmaped file makes sense, but is actually shared amongst the various processes)

The issue is that the single mmap_object is being shared among the threads so that thread A calls read and before it gets to the seek, thread B also calls read, and so gets no data.
What you really need is an ability to duplicate the python mmap object without duplicating the underlying mmap, but I see no way of doing that.
I think the only feasible solution short of rewriting the object implementation is to employ a lock (mutex, etc) per mmap object to prevent two threads from accessing the same object at the same time.

Related

Sequentually unpickle large files asynchronously

I have a directory of Pickled lists which I would like to load sequentially, use as part of an operation, and then discard. The files are around 0.75 - 2GB each when pickled and I can load a number in memory at any one time, although nowhere near all of them. Each pickled file represents one day of data.
Currently, the unpickling process consumes a substantial proportion of the runtime of the program. My proposed solution is to load the first file and, whilst the operation is running on this file, asynchronously load the next file in the list.
I have thought of two ways I could do this: 1) Threading and 2) Asyncio. I have tried both of these but neither has seemed to work. Below is my (attempted) implementation of a Threading-based solution.
import os
import threading
import pickle
class DataSource:
def __init__(self, folder):
self.folder = folder
self.next_file = None
def get(self):
if self.next_file is None:
self.load_file()
data = self.next_file
io_thread = threading.Thread(target=self.load_file, daemon=True)
io_thread.start()
return data
def get_next_file(self):
for filename in sorted(os.listdir(self.folder)):
yield self.folder + filename
def load_file(self):
self.next_file = pickle.load(open(next(self.get_next_file()), "rb"))
The main program will call DataSource().get() to retrieve each file. The first time it is loaded, load_file() will load the file into next_file where it will be stored. Then, the thread io_thread should load each successive file into next_file to be returned via get() as needed.
The thread that is launched does appear to do some work (it consumes a vast amount of RAM, ~60GB) however it does not appear to update next_file.
Could someone suggest why this doesn't work? And, additionally, if there is a better way to achieve this result?
Thanks
DataSource().get() seems to be your first problem: that means you create a new instance of the DataSource class always, and only ever get to load the first file, because you never call the same DataSource object instance again so that you'd proceed to the next file. Maybe you mean to do along the lines of:
datasource = DataSource()
while datasource.not_done():
datasource.get()
It would be useful to share the full code, and preferrably on repl.it or somewhere where it can be executed.
Also, if you want better performance, I might be worthwhile to look into the multiprocessing module, as Python blocks some operations with the global interpreter lock (GIL) so that only one thread runs at a time, even when you have multiple CPU cores. That might not be a problem though in your case as reading from disk is probably the bottleneck, I'd guess Python releases the lock while executing underlying native code to read from filesystem.
I'm also curious about how you could use asyncio for pickles .. I guess you read the pickle to mem from a file first, and then unpickle when it's done, while doing other processing during the loading. That seems like it could work nicely.
Finally, I'd add debug prints to see what's going on.
Update: Next problem seems to be that you are using the get_next_file generator wrongly. There you create a new generator each time with self.get_next_file(), so you only ever load the first file. You should only create the generator once and then call next() on it. Maybe this helps to understand, is also on replit:
def get_next_file():
for filename in ['a', 'b', 'c']:
yield filename
for n in get_next_file():
print(n)
print("---")
print(next(get_next_file()))
print(next(get_next_file()))
print(next(get_next_file()))
print("---")
gen = get_next_file()
print(gen)
print(next(gen))
print(next(gen))
print(next(gen))
Output:
a
b
c
---
a
a
a
---
<generator object get_next_file at 0x7ff4757f6cf0>
a
b
c
https://repl.it/#ToniAlatalo/PythonYieldNext#main.py
Again, debug prints would help you see what's going on, what file you are loading when etc.

Clean, pythonic way for concurrent data loaders?

Python 3
I would like to know what a really clean, pythonic concurrent data loader should look like. I need this approach for a project of mine that does heavy computations on data that is too big to entirely fit into memory. Hence, I implemented data loaders that should run concurrently and store data in a queue, so that the main process can work while (in the mean time) the next data is being loaded & prepared. Of course, the queue should block when it is empty (main process trying to consume more items -> queue should wait for new data) or full (worker process should wait until main process consumes data out of the queue to prevent out-of-memory errors).
I have written a class to fulfill this need using Python's multiprocessing module (multiprocessing.Queue and multiprocessing.Process). The crucial parts of the class are implemented as follows:
import multiprocessing as mp
from itertools import cycle
class ConcurrentLoader:
def __init__(path_to_data, queue_size, batch_size):
self._batch_size
self._path = path_to_data
filenames = ... # filenames for path 'path_to_data',
# get loaded using glob
self._files = cycle()
self._q = mp.Queue(queue_size)
...
self._worker = mp.Process(target=self._worker_func, daemon=True)
self._worker.start() # only started, never stopped
def _worker_func(self):
while True:
buffer = list()
for i in range(batch_size):
f = next(self._files)
... # load f and do some pre-processing with NumPy
... # add it to buffer
self._q.put(np.array(buffer).astype(np.float32))
def get_batch_data(self):
self._q.get()
The class has some more methods, but they are all for "convenience functionality". For example, it counts in a dict how often each file was loaded, how often the whole data set was loaded and so on, but these are rather easy to implement in Python and do not waste much computation time (sets, dicts, ...).
The data part itself on the other hand, due to I/O and pre-processing, can even take seconds. That is the reason why I want this to happen concurrently.
ConcurrentLoader should:
block main process: if get_batch_data is called, but queue is empty
block worker process: if queue is full, to prevent out-of-memory errors and prevent while True from wasting resources
be "transparent" to any class that uses ConcurrentLoader: they should just supply the path to the data and use get_batch_data without noticing that this actually works concurrently ("hassle free usage")
terminate its worker when main process dies to free resources again
Considering these goals (have I forgotten anything?) what should I do to enhance the current implementation? Is it thread/dead-lock safe? Is there a more "pythonic" way of implementation? Can I get it more clean? Does waste resources somehow?
Any class that uses ConcurrentLoader would roughly follow this setup:
class Foo:
...
def do_something(self):
...
data1 = ConcurrentLoader("path/to/data1", 64, 8)
data2 = ConcurrentLoader("path/to/data2", 256, 16)
...
sample1 = data1.get_batch_data()
sample2 = data2.get_batch_data()
... # heavy computations with data contained in 'sample1' & 'sample2'
# go *here*
Please either point out mistakes of any kind in order to improve my approach or supply an own, cleaner, more pythonic approach.
Blocking when a multiprocessing.Queue is empty/full and
get()/put() is called on it happens automatically.
This behavior is transparent to calling functions.
Use self._worker.daemon = True before self._worker.start() so the worker(s) will automatically be killed when main process exits

Is this Python code a safe way to use multi-threading

An application I use for graphics has an embedded Python interpreter - It works exactly the same as any other Python interpreter except there are a few special objects.
Basically I am trying to use Python to download a bunch of images and make other Network and disk I/O. If I do this without multithreading, my application will freeze (i.e. videos quit playing) until the downloads are finished.
To get around this I am trying to use multi-threading. However, I can not touch any of the main process.
I have written this code. The only parts unique to the program are commented. me.store / me.fetch is basically a way of getting a global variable. op('files') refers to a global table.
These are two things, "in the main process" that can only be touched in a thread safe way. I am not sure if my code does this.
I would apprecaite any input as to why or (why not) this code is thread-safe and how I can get around access the global variables in a thread safe way.
One thing I am worried about is how the counter is fetched multiple times by many threads. Since it is only updated after the file is written, could this cause a race-condition where the different threads access the counter with the same value (and then don't store the incremented value correctly). Or, what happens to the counter if the disk write fails.
from urllib import request
import threading, queue, os
url = 'http://users.dialogfeed.com/en/snippet/dialogfeed-social-wall-twitter-instagram.json?api_key=ac77f8f99310758c70ee9f7a89529023'
imgs = [
'http://search.it.online.fr/jpgs/placeholder-hollywood.jpg.jpg',
'http://www.lpkfusa.com/Images/placeholder.jpg',
'http://bi1x.caltech.edu/2015/_images/embryogenesis_placeholder.jpg'
]
def get_pic(url):
# Fetch image data
data = request.urlopen(url).read()
# This is the part I am concerned about, what if multiple threads fetch the counter before it is updated below
# What happens if the file write fails?
counter = me.fetch('count', 0)
# Download the file
with open(str(counter) + '.jpg', 'wb') as outfile:
outfile.write(data)
file_name = 'file_' + str(counter)
path = os.getcwd() + '\\' + str(counter) + '.jpg'
me.store('count', counter + 1)
return file_name, path
def get_url(q, results):
url = q.get_nowait()
file_name, path = get_pic(url)
results.append([file_name, path])
q.task_done()
def fetch():
# Clear the table
op('files').clear()
results = []
url_q = queue.Queue()
# Simulate getting a JSON feed
print(request.urlopen(url).read().decode('utf-8'))
for img in imgs:
# Add url to queue and start a thread
url_q.put(img)
t = threading.Thread(target=get_url, args=(url_q, results,))
t.start()
# Wait for threads to finish before updating table
url_q.join()
for cell in results:
op('files').appendRow(cell)
return
# Start a thread so that the first http get doesn't block
thread = threading.Thread(target=fetch)
thread.start()
Your code doesn't appear to be safe at all. Key points:
Appending to results is unsafe -- two threads might try to append to the list at the same time.
Accessing and setting counter is unsafe -- a thread my fetch counter before another thread has set the new counter value.
Passing a queue of urls is redundant -- just pass a new url to each job.
Another way (concurrent.futures)
Since you are using python 3, why not make use of the concurrent.futures module, which makes your task much easier to manage. Below I've written out your code in a way which does not require explicit synchronisation -- all the work is handled by the futures module.
from urllib import request
import os
import threading
from concurrent.futures import ThreadPoolExecutor
from itertools import count
url = 'http://users.dialogfeed.com/en/snippet/dialogfeed-social-wall-twitter-instagram.json?api_key=ac77f8f99310758c70ee9f7a89529023'
imgs = [
'http://search.it.online.fr/jpgs/placeholder-hollywood.jpg.jpg',
'http://www.lpkfusa.com/Images/placeholder.jpg',
'http://bi1x.caltech.edu/2015/_images/embryogenesis_placeholder.jpg'
]
def get_pic(url, counter):
# Fetch image data
data = request.urlopen(url).read()
# Download the file
with open(str(counter) + '.jpg', 'wb') as outfile:
outfile.write(data)
file_name = 'file_' + str(counter)
path = os.getcwd() + '\\' + str(counter) + '.jpg'
return file_name, path
def fetch():
# Clear the table
op('files').clear()
with ThreadPoolExecutor(max_workers=2) as executor:
count_start = me.fetch('count', 0)
# reserve these numbers for our tasks
me.store('count', count_start + len(imgs))
# separate fetching and storing is usually not thread safe
# however, if only one thread modifies count (the one running fetch) then
# this will be safe (same goes for the files variable)
for cell in executor.map(get_pic, imgs, count(count_start)):
op('files').appendRow(cell)
# Start a thread so that the first http get doesn't block
thread = threading.Thread(target=fetch)
thread.start()
If multiple threads modify count then you should use a lock when modifying count.
eg.
lock = threading.Lock()
def fetch():
...
with lock:
# Do not release the lock between accessing and modifying count.
# Other threads wanting to modify count, must use the same lock object (not
# another instance of Lock).
count_start = me.fetch('count', 0)
me.store('count', count_start + len(imgs))
# use count_start here
The only problem with this if one job fails for some reason then you will get a missing file number. Any raised exception will also interrupt the executor doing the mapping, by re-raising the exception there --so you can then do something if needed.
You could avoid using a counter by using the tempfile module to find somewhere to temporarily store a file before moving the file somewhere permanent.
Remember to look at multiprocessing and threading if you are new to python multi-threading stuff.
Your code seems ok, though the code style is not very easy to read. You need to run it to see if it works as your expectation.
with will make sure your lock is released. The acquire() method will be called when the block is entered, and release() will be called when the block is exited.
If you add more threads, make sure they are not using the same address from queue and no race condition (seems it is done by Queue.get(), but you need to run it to verify). Remember, each threads share the same process so almost everything is shared. You don't want two threads are handling the same address
The Lock doesn't do anything at all. You only have one thread that ever calls download_job - that's the one you assigned to my_thread. The other one, the main thread, calls offToOn and is finished as soon as it reaches the end of that function. So there is no second thread that ever tries to acquire the lock, and hence no second thread ever gets blocked. The table you mention is, apparently, in a file that you explicitly open and close. If the operating system protects this file against simultaneous access from different programs, you can get away with this; otherwise it is definitely unsafe because you haven't accomplished any thread synchronization.
Proper synchronization between threads requires that different threads have access to the SAME lock; i.e., one lock is accessed by multiple threads. Also note that "thread" is not a synonym for "process." Python supports both. If you're really supposed to avoid accessing the main process, you have to use the multiprocessing module to launch and manage a second process.
And this code will never exit, since there is always a thread running in an infinite loop (in threader).
Accessing a resource in a thread-safe manner requires something like this:
a_lock = Lock()
def use_resource():
with a_lock:
# do something
The lock is created once, outside the function that uses it. Every access to the resource in the whole application, from whatever thread, must acquire the same lock, either by calling use_resource or some equivalent.

Proper locking in method calls

I'm talking to a measurement device. I basically send commands and receive answers. But I'm providing a method ask that sends a command and reads back the answer. If I lock this method I get a deadlock due to the called methods read and write locking aswell. If I don't lock another thread could steal the answer or write before I'm reading. How would you implement this?
import threading
class Device(object):
lock = threading.Lock()
def ask(self, value):
# can't use lock here would block
self.write(value) # another thread could start reading the answer
return self.read()
def read(self):
with self.lock:
# read values from device
def write(self, value):
with self.lock:
# send command to device
Use threading.RLock() to avoid contention within a single thread:
A reentrant lock is a synchronization primitive that may be acquired
multiple times by the same thread. Internally, it uses the concepts of
“owning thread” and “recursion level” in addition to the
locked/unlocked state used by primitive locks. In the locked state,
some thread owns the lock; in the unlocked state, no thread owns it.
You can use threading.RLock() object to do reentrant lock. But t is better to rewrite code such way, it do not need RLock. For example you could remove locks from write(), read() and rewrite ask() similar way
with self.lock:
self.write(value)
r = self.read()
return r
RLock() in old versions of python work slower, because it is more complicated in implementation.
Also note, in code you wrote, you get one lock for all instances. In some cases it is appropriate (for example if you have only one device and many instances), but in general not. If you want different locks for different instances, put its initialization in __init__() method.

In python when threads die?

I have a service that spawns threads.
And i may have a leak of resources in a code i am using.
I have similar code in python that uses threads
import threading
class Worker(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
# now i am using django orm to make a query
dataList =Mydata.objects.filter(date__isnull = True )[:chunkSize]
print '%s - DB worker finished reading %s entrys' % (datetime.now(),len(dataList))
while True:
myWorker = Worker()
mwWorker.start()
while myWorker.isalive(): # wait for worker to finish
do_other_work()
is it ok ?
will the threads die when they finish executing the run method ?
do i cause a leak in resources ?
Looking at your previous question (that you linkd in a comment) the problem is that you're running out of file descriptors.
From the official doc:
File descriptors are small integers corresponding to a file that has been opened by the current process. For example, standard input is usually file descriptor 0, standard output is 1, and standard error is 2. Further files opened by a process will then be assigned 3, 4, 5, and so forth. The name “file descriptor” is slightly deceptive; on Unix platforms, sockets and pipes are also referenced by file descriptors.
Now I'm guessing, but it could be that you're doing something like:
class Wroker(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
my_file = open('example.txt')
# operations
my_file.close() # without this line!
You need to close your files!
You're probably starting many threads and each one of them is opening but not closing a file, this way after some time you don't have more "small integers" to assign for opening a new file.
Also note that in the #operations part anything could happen, if an exception is thrown the file will not be close unless wrapped in a try/finally statement.
There's a better way for dealing with files: the with statement:
with open('example.txt') as my_file:
# bunch of operations with the file
# other operations for which you don't need the file
Once a thread object is created, its activity must be started by calling the thread’s start() method. This invokes the run() method in a separate thread of control.
Once the thread’s activity is started, the thread is considered ‘alive’. It stops being alive when its run() method terminates – either normally, or by raising an unhandled exception. The is_alive() method tests whether the thread is alive.
From python site

Categories