I have a database with a lot of records and a code with Django framework. right now I run a query on that database and collect the results in a list. each record has a field called priority, then with a for statement process them one by one according that priority. But I have a problem.
My database is very dynamic and while I'm processing current list may I have a new record in database with a higher priority! I have to process it first but it current architecture, I can't, I have to wait to terminate current list processes. how i can achieve my goal?
I have an alternative way but i'm not sure that it is the best way. Inside a while statement, I can run a query to database and fetch only one record that has higher priority.
What's your opinion to my alternative solution? Is there a better way?
You can use treading to get threads that process or sub-process your high priority data with Queue, as said by WhozCraig.
Here's an example of how it could look like. If you want to use multiple threads and more functions than only run() you will have to redefine the thread object calling from thread1_high_priority = High_priority_Thread(1, 10, queue)# where the parameters are defined in run() to
thread1_high_priority = High_priority_Thread(target= functionname, name = name)# and the same in init, def init (self, target, name):.
import Queue
import threading
import time
queue = Queue.Queue()
class High_priority_first(threading.Thread):
""" a threading class"""
def __init__ (self, start, stop, queue):
self.start = start
self.stop = stop
self.queue = queue
threading.Thread.__init__(self)
# Write a function, run(), that counts the higher priority data and extend it to
# also count lower priority, or create another function for low priority data and
# run them with a separate thread than thread 1.
def run(self):
while True:
if self.start != stop:
self.start += 1
self.queue.put(self.start)
else:
break
thread1_high_priority = High_priority_Thread(1, 10, queue)# start at 1 and stop at 10
thread1_high_priority.start() #start thread1
thread2_lower_priority = High_priority_Thread(1, 3, queue)# start at 1 and stop at 3
thread2_lower_priority.start() #start thread2
while True:
if queue != None: # check that queue isn't empty
out = queue.get()
print out
else:
break
Related
My app gets new tasks infinitely, I have created a class that will handle all these incoming tasks:
class Executor:
pool: ThreadPool
def __init__(self, pool_size: int):
self.pool = ThreadPool(pool_size)
def start(self):
while True:
self.refresh_args()
self.pool.map(self.handler, self.args)
self.pool.join()
This code is wrong, of course. The problem is that I don't need to wait for all tasks in the pool. The Executor must add a new task to pool as soon as at least one thread will finish work. It will be endless loop and all threads in the pool must be busy always.
How to implement this logic? Or maybe should I look for another way without using ThreadPool? How is it implemented in other software?
You can do it by using a multiprocessing.Queue, passing the number of tasks as the max number of elements in the Queue.
When you put something in the queue, that thread keeps waiting until it is in the queue. At the same time, you can make your loop like
while True:
queue.get() # blocks if queue is empty
and put every element in a new Thread:
class Executor:
pool: ThreadPool
def __init__(self, pool_size: int):
self.elements = multiprocessing.Queue(pool_size)
def start(self):
while True:
self.refresh_args()
element = self.elements.get() # blocks if queue is empty
# put element in new thread
# when task is finished, put new element in queue
I'm using the multiprocessing module to split up a very large task. It works for the most part, but I must be missing something obvious with my design, because this way it's very hard for me to effectively tell when all of the data has been processed.
I have two separate tasks that run; one that feeds the other. I guess this is a producer/consumer problem. I use a shared Queue between all processes, where the producers fill up the queue, and the consumers read from the queue and do the processing. The problem is that there is a finite amount of data, so at some point everyone needs to know that all of the data has been processed so the system can shut down gracefully.
It would seem to make sense to use the map_async() function, but since the producers are filling up the queue, I don't know all of the items up front, so I have to go into a while loop and use apply_async() and try to detect when everything is done with some sort of timeout...ugly.
I feel like I'm missing something obvious. How can this be better designed?
PRODCUER
class ProducerProcess(multiprocessing.Process):
def __init__(self, item, consumer_queue):
self.item = item
self.consumer_queue = consumer_queue
multiprocessing.Process.__init__(self)
def run(self):
for record in get_records_for_item(self.item): # this takes time
self.consumer_queue.put(record)
def start_producer_processes(producer_queue, consumer_queue, max_running):
running = []
while not producer_queue.empty():
running = [r for r in running if r.is_alive()]
if len(running) < max_running:
producer_item = producer_queue.get()
p = ProducerProcess(producer_item, consumer_queue)
p.start()
running.append(p)
time.sleep(1)
CONSUMER
def process_consumer_chunk(queue, chunksize=10000):
for i in xrange(0, chunksize):
try:
# don't wait too long for an item
# if new records don't arrive in 10 seconds, process what you have
# and let the next process pick up more items.
record = queue.get(True, 10)
except Queue.Empty:
break
do_stuff_with_record(record)
MAIN
if __name__ == "__main__":
manager = multiprocessing.Manager()
consumer_queue = manager.Queue(1024*1024)
producer_queue = manager.Queue()
producer_items = xrange(0,10)
for item in producer_items:
producer_queue.put(item)
p = multiprocessing.Process(target=start_producer_processes, args=(producer_queue, consumer_queue, 8))
p.start()
consumer_pool = multiprocessing.Pool(processes=16, maxtasksperchild=1)
Here is where it gets cheesy. I can't use map, because the list to consume is being filled up at the same time. So I have to go into a while loop and try to detect a timeout. The consumer_queue can become empty while the producers are still trying to fill it up, so I can't just detect an empty queue an quit on that.
timed_out = False
timeout= 1800
while 1:
try:
result = consumer_pool.apply_async(process_consumer_chunk, (consumer_queue, ), dict(chunksize=chunksize,))
if timed_out:
timed_out = False
except Queue.Empty:
if timed_out:
break
timed_out = True
time.sleep(timeout)
time.sleep(1)
consumer_queue.join()
consumer_pool.close()
consumer_pool.join()
I thought that maybe I could get() the records in the main thread and pass those into the consumer instead of passing the queue in, but I think I end up with the same problem that way. I still have to run a while loop and use apply_async() Thank you in advance for any advice!
You could use a manager.Event to signal the end of the work. This event can be shared between all of your processes and then when you signal it from your main process the other workers can then gracefully shutdown.
while not event.is_set():
...rest of code...
So, your consumers would wait for the event to be set and handle the cleanup once it is set.
To determine when to set this flag you can do a join on the producer threads and when those are all complete you can then join on the consumer threads.
I would like to strongly recommend SimPy instead of multiprocess/threading to do discrete event simulation.
I work on a machine learning input pipeline. I wrote a data loader that reads in data from a large .hdf file and returns slices, which takes roughly 2 seconds per slice. Therefore I would like to use a queue, that takes in objects from several data loaders and can return single objects from the queue via a next function (like a generator). Furthermore the processes that fill the queue should run somehow in the background, refilling the queue when it is not full. I do not get it to work properly. It worked with a single dataloader, giving me 4 times the same slices..
import multiprocessing as mp
class Queue_Generator():
def __init__(self, data_loader_list):
self.pool = mp.Pool(4)
self.data_loader_list = data_loader_list
self.queue = mp.Queue(maxsize=16)
self.pool.map(self.fill_queue, self.data_loader_list)
def fill_queue(self,gen):
self.queue.put(next(gen))
def __next__(self):
yield self.queue.get()
What I get from this:
NotImplementedError: pool objects cannot be passed between processes or pickled
Thanks in advance
Your specific error means that you cannot have a pool as part of your class when you are passing class methods to a pool. What I would suggest could be the following:
import multiprocessing as mp
from queue import Empty
class QueueGenerator(object):
def __init__(self, data_loader_list):
self.data_loader_list = data_loader_list
self.queue = mp.Queue(maxsize=16)
def __iter__(self):
processes = list()
for _ in range(4):
pr = mp.Process(target=fill_queue, args=(self.queue, self.data_loader_list))
pr.start()
processes.append(pr)
return self
def __next__(self):
try:
return self.queue.get(timeout=1) # this should have a value, otherwise your loop will never stop. make it something that ensures your processes have enough time to update the queue but not too long that your program freezes for an extended period of time after all information is processed
except Empty:
raise StopIteration
# have fill queue as a separate function
def fill_queue(queue, gen):
while True:
try:
value = next(gen)
queue.put(value)
except StopIteration: # assumes the given data_loader_list is an iterator
break
print('stopping')
gen = iter(range(70))
qg = QueueGenerator(gen)
for val in qg:
print(val)
# test if it works several times:
for val in qg:
print(val)
The next issue for you to solve I think is to have the data_loader_list be something that provides new information in every separate process. But since you have not given any information about that I can't help you with that. The above does however provide you a way to have the processes fill your queue which is then passed out as an iterator.
Not quite sure why you are yielding in __next__, that doesn't look quite right to me. __next__ should return a value, not a generator object.
Here is a simple way that you can return the results of parallel functions as a generator. It may or may not meet your specific requirements but can be tweaked to suit. It will keep on processing data_loader_list until it is exhausted. This may use a lot of memory compared to keeping, for example, 4 items in a Queue at all times.
import multiprocessing as mp
def read_lines(data_loader):
from time import sleep
sleep(2)
return f'did something with {data_loader}'
def make_gen(data_loader_list):
with mp.Pool(4) as pool:
for result in pool.imap(read_lines, data_loader_list):
yield result
if __name__ == '__main__':
data_loader_list = [i for i in range(15)]
result_generator = make_gen(data_loader_list)
print(type(result_generator))
for i in result_generator:
print(i)
Using imap means that the results can be processed as they are produced. map and map_async would block in the for loop until all results were ready. See this question for more.
I'm working with a multi-threaded script, where a Controller thread puts a varying number of items in a class queue shared by multiple other Worker threads. I'm looking for a way to have the Controller class wait until all tasks are completed by the other threads. I have something similar to the below:
Worker class
class Worker(threading.Thread):
q = queue.Queue()
evt_stop = threading.Event()
def task(self, *data):
result = data[1] + data[1]
data[0].q.put(result)
def __init__(self):
...
def run(self):
while not Worker.evt_stop.is_set():
if not Worker.q.empty():
data = Worker.q.get()
task(data[0], data[1])
Worker.q.task_done()
Controller class
class Controller(threading.Thread):
evt_stop = threading.Event()
def qsize(self, n):
if self.q.qsize() != n:
return False
else:
return True
def __init__(self):
self.q = queue.Queue()
self.await = threading.Condition()
def run(self):
r = [<list of unknown length>]
for i in r:
Worker.q.put(self, i)
with self.await:
if self.await.wait_for(lambda: self.qsize(len(r)), timeout=5.0):
while not self.q.empty():
x = self.q.get()
print(x)
self.q.task_done()
But from my understanding of the answer and clarification provided here, the lambda will always return True, as it is the function object that is returned, not necessarily the value ("self.test(1) is the object that is the result of calling the method, which is a bool, not a callable"). Am I understanding this correctly? Additionally, am I greatly over-complicating this, and a simpler solution exists?
Clarification of intent:
This is essentially a script that performs multiple, network oriented functions. The main thread runs a menu system from which the user can select different network I/O tasks. Each of these tasks runs in a separate thread, instantiated by instances of the controller class. When the script first runs, it instantiates a set of worker threads, typically 4 or 6. The idea is that the user can select one of the instantiated controller class objects, and it will set about fulfilling its logic, which involves sending/receiving data through the worker threads, then modifying that data, and repeating the process. The issue I'm trying to tackle is having the controller object wait until the worker threads complete all the tasks that specific controller instance put into the shared worker class queue. It is unknown in advance how many tasks that will be, as it is dependent upon the results of the very first network I/O task.
I'm curious if there is a way to lock a multiprocessing.Queue object manually.
I have a pretty standard Producer/Consumer pattern set up in which my main thread is constantly producing a series of values, and a pool of multiprocessing.Process workers is acting on the values produced.
It is all controlled via a sole multiprocessing.Queue().
import time
import multiprocessing
class Reader(multiprocessing.Process):
def __init__(self, queue):
multiprocessing.Process.__init__(self)
self.queue = queue
def run(self):
while True:
item = self.queue.get()
if isinstance(item, str):
break
if __name__ == '__main__':
queue = multiprocessing.Queue()
reader = Reader(queue)
reader.start()
start_time = time.time()
while time.time() - start_time < 10:
queue.put(1)
queue.put('bla bla bla sentinal')
queue.join()
The issue I'm running into is that my worker pool cannot consume and process the queue as fast as the main thread insert values into it. So after some period of time, the Queue is so unwieldy that it pops a MemoryError.
An obvious solution would be to simply add a wait check in the producer to stall it from putting any more values into the queue. Something along the lines of:
while time.time() - start_time < 10:
queue.put(1)
while queue.qsize() > some_size:
time.sleep(.1)
queue.put('bla bla bla sentinal')
queue.join()
However, because of the funky nature of the program, I'd like to dump everything in the Queue to a file for later processing. But! Without being able to temporarily lock the queue, the worker can't consume everything in it as the producer is constantly filling it back up with junk -- conceptually anyway. After numerous tests it seems that at some point one of the locks wins (but usually the one adding to the queue).
Edit: Also, I realize it'd be possible to simply stop the producer and consume it from that thread... but that makes the Single Responsibility guy in me feel sad, as the producer is a Producer, not a Consumer.
Edit:
After looking through the source of Queue, I came up with this:
def dump_queue(q):
q._rlock.acquire()
try:
res = []
while not q.empty():
res.append(q._recv())
q._sem.release()
return res
finally:
q._rlock.release()
However, I'm too scared to use it! I have no idea if this is "correct" or not. I don't have a firm enough grasp to know if this'll hold up without blowing up any of Queues internals.
Anyone know if this'll break? :)
Given what was said in the comments, a Queue is simply a wrong data structure for your problem - but is likely part of a usable solution.
It sounds like you have only one Producer. Create a new, Producer-local (not shared across processes) class implementing the semantics you really need. For example,
class FlushingQueue:
def __init__(self, mpqueue, path_to_spill_file, maxsize=1000, dumpsize=1000000):
from collections import deque
self.q = mpqueue # a shared `multiprocessing.Queue`
self.dump_path = path_to_spill_file
self.maxsize = maxsize
self.dumpsize = dumpsize
self.d = deque() # buffer for overflowing values
def put(self, item):
if self.q.qsize() < self.maxsize:
self.q.put(item)
# in case consumers have made real progress
while self.d and self.q.qsize() < self.maxsize:
self.q.put(self.d.popleft())
else:
self.d.append(item)
if len(self.d) >= self.dumpsize:
self.dump()
def dump(self):
# code to flush self.d to the spill file; no
# need to look at self.q at all
I bet you can make this work :-)