I am working on a threaded application where one thread will feed a Queue with objects to be modified and a number of other threads will then read from the queue, do the modifications and save the changes.
The application won't need a lot of concurrency, so I would like to stick to an SQLite database. Here is a small example illustrating the application:
import queue
import threading
import peewee as pw
db = pw.SqliteDatabase('test.db', threadlocals=True)
class Container(pw.Model):
contents = pw.CharField(default="spam")
class Meta:
database = db
class FeederThread(threading.Thread):
def __init__(self, input_queue):
super().__init__()
self.q = input_queue
def run(self):
containers = Container.select()
for container in containers:
self.q.put(container)
class ReaderThread(threading.Thread):
def __init__(self, input_queue):
super().__init__()
self.q = input_queue
def run(self):
while True:
item = self.q.get()
with db.execution_context() as ctx:
# Get a new connection to the container object:
container = Container.get(id=item.id)
container.contents = "eggs"
container.save()
self.q.task_done()
if __name__ == "__main__":
db.connect()
try:
db.create_tables([Container,])
except pw.OperationalError:
pass
else:
[Container.create() for c in range(42)]
db.close()
q = queue.Queue(maxsize=10)
feeder = FeederThread(q)
feeder.setDaemon(True)
feeder.start()
for i in range(10):
reader = ReaderThread(q)
reader.setDaemon(True)
reader.start()
q.join()
Based on the peewee docs multi-threading should be supported for SQLite. However, I keep getting the infamous peewee.OperationalError: database is locked error with the error output pointing to the container.save() line.
How do I get around this?
I was kind of surprised to see this failing as well, so I copied your code and played around with some different ideas. What I think the problem is, is that ExecutionContext() by default will cause the wrapped block to run in a transaction. To avoid this, I passed in False in the reader threads.
I also edited the feeder to consume the SELECT statement before putting stuff into the queue (list(Container.select())).
The following works for me locally:
class FeederThread(threading.Thread):
def __init__(self, input_queue):
super(FeederThread, self).__init__()
self.q = input_queue
def run(self):
containers = list(Container.select())
for container in containers:
self.q.put(container.id) # I don't like passing model instances around like this, personal preference though
class ReaderThread(threading.Thread):
def __init__(self, input_queue):
super(ReaderThread, self).__init__()
self.q = input_queue
def run(self):
while True:
item = self.q.get()
with db.execution_context(False):
# Get a new connection to the container object:
container = Container.get(id=item)
container.contents = "nuggets"
with db.atomic():
container.save()
self.q.task_done()
if __name__ == "__main__":
with db.execution_context():
try:
db.create_tables([Container,])
except OperationalError:
pass
else:
[Container.create() for c in range(42)]
# ... same ...
I'm not wholly satisfied with this, but hopefully it gives you some ideas.
Here's a blog post I wrote a while back that has some tips for getting higher concurrency with SQLite: http://charlesleifer.com/blog/sqlite-small-fast-reliable-choose-any-three-/
Have you tried WAL mode?
Improve INSERT-per-second performance of SQLite?
You have to be quite careful if you have concurrent access to SQLite, as the whole database is locked when writes are done, and although multiple readers are possible, writes will be locked out. This has been improved somewhat with the addition of a WAL in newer SQLite versions.
and
If you are using multiple threads, you can try using the shared page cache, which will allow loaded pages to be shared between threads, which can avoid expensive I/O calls.
Related
So I have been struggling with this one error of pickle which is driving me crazy. I have the following master Engine class with the following code :
import eventlet
import socketio
import multiprocessing
from multiprocessing import Queue
from multi import SIOSerever
class masterEngine:
if __name__ == '__main__':
serverObj = SIOSerever()
try:
receiveData = multiprocessing.Process(target=serverObj.run)
receiveData.start()
receiveProcess = multiprocessing.Process(target=serverObj.fetchFromQueue)
receiveProcess.start()
receiveData.join()
receiveProcess.join()
except Exception as error:
print(error)
and I have another file called multi which runs like the following :
import multiprocessing
from multiprocessing import Queue
import eventlet
import socketio
class SIOSerever:
def __init__(self):
self.cycletimeQueue = Queue()
self.sio = socketio.Server(cors_allowed_origins='*',logger=False)
self.app = socketio.WSGIApp(self.sio, static_files={'/': 'index.html',})
self.ws_server = eventlet.listen(('0.0.0.0', 5000))
#self.sio.on('production')
def p_message(sid, message):
self.cycletimeQueue.put(message)
print("I logged : "+str(message))
def run(self):
eventlet.wsgi.server(self.ws_server, self.app)
def fetchFromQueue(self):
while True:
cycle = self.cycletimeQueue.get()
print(cycle)
As you can see I can trying to create two processes of def run and fetchFromQueue which i want to run independently.
My run function starts the python-socket server to which im sending some data from a html web page ( This runs perfectly without multiprocessing). I am then trying to push the data received to a Queue so that my other function can retrieve it and play with the data received.
I have a set of time taking operations that I need to carry out with the data received from the socket which is why im pushing it all into a Queue.
On running the master Engine class I receive the following :
Can't pickle <class 'threading.Thread'>: it's not the same object as threading.Thread
I ended!
[Finished in 0.5s]
Can you please help with what I am doing wrong?
From multiprocessing programming guidelines:
Explicitly pass resources to child processes
On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
Apart from making the code (potentially) compatible with Windows and the other start methods this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.
Therefore, I slightly modified your example by removing everything unnecessary, but showing an approach where the shared queue is explicitly passed to all processes that use it:
import multiprocessing
MAX = 5
class SIOSerever:
def __init__(self, queue):
self.cycletimeQueue = queue
def run(self):
for i in range(MAX):
self.cycletimeQueue.put(i)
#staticmethod
def fetchFromQueue(cycletimeQueue):
while True:
cycle = cycletimeQueue.get()
print(cycle)
if cycle >= MAX - 1:
break
def start_server(queue):
server = SIOSerever(queue)
server.run()
if __name__ == '__main__':
try:
queue = multiprocessing.Queue()
receiveData = multiprocessing.Process(target=start_server, args=(queue,))
receiveData.start()
receiveProcess = multiprocessing.Process(target=SIOSerever.fetchFromQueue, args=(queue,))
receiveProcess.start()
receiveData.join()
receiveProcess.join()
except Exception as error:
print(error)
0
1
...
I've two classes - MessageProducer and MessageConsumer.
MessageConsumer does the following:
receives messages and puts them in its message list "_unprocessed_msgs"
on a separate worker thread, moves the messages to internal list "_in_process_msgs"
on the worker thread, processes messages from "_in_process_msgs"
On my development environment, I'm facing issue with #2 above - after adding a message by performing step#1, when worker thread checks length of "_unprocessed_msgs", it gets it as zero.
When step #1 is repeated, the list properly shows 2 items on the thread on which the item was added. But in step #2, on worker thread, again the len(_unprocessed_msgs) returns zero.
Not sure why this is happening. Would really appreciate help any help on this.
I'm using Ubuntu 16.04 having Python 2.7.12.
Below is the sample source code. Please let me know if more information is required.
import threading
import time
class MessageConsumerThread(threading.Thread):
def __init__(self):
super(MessageConsumerThread, self).__init__()
self._unprocessed_msg_q = []
self._in_process_msg_q = []
self._lock = threading.Lock()
self._stop_processing = False
def start_msg_processing_thread(self):
self._stop_processing = False
self.start()
def stop_msg_processing_thread(self):
self._stop_processing = True
def receive_msg(self, msg):
with self._lock:
LOG.info("Before: MessageConsumerThread::receive_msg: "
"len(self._unprocessed_msg_q)=%s" %
len(self._unprocessed_msg_q))
self._unprocessed_msg_q.append(msg)
LOG.info("After: MessageConsumerThread::receive_msg: "
"len(self._unprocessed_msg_q)=%s" %
len(self._unprocessed_msg_q))
def _queue_unprocessed_msgs(self):
with self._lock:
LOG.info("MessageConsumerThread::_queue_unprocessed_msgs: "
"len(self._unprocessed_msg_q)=%s" %
len(self._unprocessed_msg_q))
if self._unprocessed_msg_q:
LOG.info("Moving messages from unprocessed to in_process queue")
self._in_process_msg_q += self._unprocessed_msg_q
self._unprocessed_msg_q = []
LOG.info("Moved messages from unprocessed to in_process queue")
def run(self):
while not self._stop_processing:
# Allow other threads to add messages to message queue
time.sleep(1)
# Move unprocessed listeners to in-process listener queue
self._queue_unprocessed_msgs()
# If nothing to process continue the loop
if not self._in_process_msg_q:
continue
for msg in self._in_process_msg_q:
self.consume_message(msg)
# Clean up processed messages
del self._in_process_msg_q[:]
def consume_message(self, msg):
print(msg)
class MessageProducerThread(threading.Thread):
def __init__(self, producer_id, msg_receiver):
super(MessageProducerThread, self).__init__()
self._producer_id = producer_id
self._msg_receiver = msg_receiver
def start_producing_msgs(self):
self.start()
def run(self):
for i in range(1,10):
msg = "From: %s; Message:%s" %(self._producer_id, i)
self._msg_receiver.receive_msg(msg)
def main():
msg_receiver_thread = MessageConsumerThread()
msg_receiver_thread.start_msg_processing_thread()
msg_producer_thread = MessageProducerThread(producer_id='Producer-01',
msg_receiver=msg_receiver_thread)
msg_producer_thread.start_producing_msgs()
msg_producer_thread.join()
msg_receiver_thread.stop_msg_processing_thread()
msg_receiver_thread.join()
if __name__ == '__main__':
main()
Following is the log the I get:
INFO: MessageConsumerThread::_queue_unprocessed_msgs: len(self._unprocessed_msg_q)=0
INFO: Before: MessageConsumerThread::receive_msg: len(self._unprocessed_msg_q)=0
INFO: After: MessageConsumerThread::receive_msg: **len(self._unprocessed_msg_q)=1**
INFO: MessageConsumerThread::_queue_unprocessed_msgs: **len(self._unprocessed_msg_q)=0**
INFO: MessageConsumerThread::_queue_unprocessed_msgs: len(self._unprocessed_msg_q)=0
INFO: Before: MessageConsumerThread::receive_msg: len(self._unprocessed_msg_q)=1
INFO: After: MessageConsumerThread::receive_msg: **len(self._unprocessed_msg_q)=2**
INFO: MessageConsumerThread::_queue_unprocessed_msgs: **len(self._unprocessed_msg_q)=0**
This is not a good desing for you application.
I spent some time trying to debug this - but threading code is naturally complicated, so we should try to descomplicate it, instead of getting it even more confure.
When I see threading code in Python, I usually see it written a in a procedural form: a normal function that is passed to threading.Thread as the target argument that drives each thread. That way, you don't need to write code for a new class that will have a single instance.
Another thing is that, although Python's global interpreter lock itself guarantees lists won't get corrupted if modified in two separate threads, lists are not a recomended "thread data passing" data structure. You probably should look at threading.Queue to do that
The thing is wrong in this code at first sight is probably not the cause of your problem due to your use of locks, but it might be. Instead of
self._unprocessed_msg_q = []
which will create a new list object, the other thread have momentarily no reference too (so it might write data to the old list), you should do:
self._unprocessed_msg_q[:] = []
Or just the del slice thing you do on the other method.
But to be on the safer side, and having mode maintanable and less surprising code, you really should change to a procedural approach there, assuming Python threading. Assume "Thread" is the "final" object that can do its thing, and then use Queues around:
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
from threading import Thread
try:
from queue import Queue, Empty
except ImportError:
from Queue import Queue, Empty
import time
import random
TERMINATE_SENTINEL = object()
NO_DATA_SENTINEL = object()
class Receiver(object):
def __init__(self, queue):
self.queue = queue
self.in_process = []
def receive_data(self, data):
self.in_process.append(data)
def consume_data(self):
print("received data:", self.in_process)
del self.in_process[:]
def receiver_loop(self):
queue = self.queue
while True:
try:
data = queue.get(block=False)
except Empty:
print("got no data from queue")
data = NO_DATA_SENTINEL
if data is TERMINATE_SENTINEL:
print("Got sentinel: exiting receiver loop")
break
self.receive_data(data)
time.sleep(random.uniform(0, 0.3))
if queue.empty():
# Only process data if we have nothing to receive right now:
self.consume_data()
print("sleeping receiver")
time.sleep(1)
if self.in_process:
self.consume_data()
def producer_loop(queue):
for i in range(10):
time.sleep(random.uniform(0.05, 0.4))
print("putting {0} in queue".format(i))
queue.put(i)
def main():
msg_queue = Queue()
msg_receiver_thread = Thread(target=Receiver(msg_queue).receiver_loop)
time.sleep(0.1)
msg_producer_thread = Thread(target=producer_loop, args=(msg_queue,))
msg_receiver_thread.start()
msg_producer_thread.start()
msg_producer_thread.join()
msg_queue.put(TERMINATE_SENTINEL)
msg_receiver_thread.join()
if __name__ == '__main__':
main()
note that since you want multiple methods in the recever thread to do things with data, I used a class - but it does not inherit from Thread, and does not have to worry about its workings. All its methods are called within the same thread: no need of locks, no worries about race conditions within the receiver class itself. For communicating outside the class, the Queue class is structured to handle any race conditions for us.
The producer loop, as it is just a dummy producer, has no need at all to be written in class form. But it would look just the same, if it had more methods.
(The random sleeps help visualize what would happen in "real world" message receiving)
Also, you might want to take a look at something like:
https://www.thoughtworks.com/insights/blog/composition-vs-inheritance-how-choose
Finally I was able to solve the issue. In the actual code, I've a Manager class that is responsible for instantiating MessageConsumerThread as its last thing in the initializer:
class Manager(object):
def __init__(self):
...
...
self._consumer = MessageConsumerThread(self)
self._consumer.start_msg_processing_thread()
The problem seems to be with passing 'self' in MessageConsumerThread initializer when Manager is still executing its initializer (eventhough those are last two steps). The moment I moved the creation of consumer out of initializer, consumer thread was able to see the elements in "_unprocessed_msg_q".
Please note that the issue is still not reproducible with the above sample code. It is manifesting itself in the production environment only. Without the above fix, I tried queue and dictionary as well but observed the same issue. After the fix, tried with queue and list and was able to successfully execute the code.
I really appreciate and thank #jsbueno and #ivan_pozdeev for their time and help! Community #stackoverflow is very helpful!
User visit http://example.com/url/ and invoke page_parser from views.py. page_parser create instance of class Foo from script.py.
Each time http://example.com/url/ is visited I see that memory usage goes up and up. I guess Garbage Collector don't collect instantiated class Foo. Any ideas why is it so?
Here is the code:
views.py:
from django.http import HttpResponse
from script import Foo
from script import urls
# When user visits http://example.com/url/ I run `page_parser`
def page_parser(request):
Foo(urls)
return HttpResponse("alldone")
script.py:
import requests
from queue import Queue
from threading import Thread
class Newthread(Thread):
def __init__(self, queue, result):
Thread.__init__(self)
self.queue = queue
self.result = result
def run(self):
while True:
url = self.queue.get()
data = requests.get(url) # Download image at url
self.result.append(data)
self.queue.task_done()
class Foo:
def __init__(self, urls):
self.result = list()
self.queue = Queue()
self.startthreads()
for url in urls:
self.queue.put(url)
self.queue.join()
def startthreads(self):
for x in range(3):
worker = Newthread(queue=self.queue, result=self.result)
worker.daemon = True
worker.start()
urls = [
"https://static.pexels.com/photos/106399/pexels-photo-106399.jpeg",
"https://static.pexels.com/photos/164516/pexels-photo-164516.jpeg",
"https://static.pexels.com/photos/206172/pexels-photo-206172.jpeg",
"https://static.pexels.com/photos/32870/pexels-photo.jpg",
"https://static.pexels.com/photos/106399/pexels-photo-106399.jpeg",
"https://static.pexels.com/photos/164516/pexels-photo-164516.jpeg",
"https://static.pexels.com/photos/206172/pexels-photo-206172.jpeg",
"https://static.pexels.com/photos/32870/pexels-photo.jpg",
"https://static.pexels.com/photos/32870/pexels-photo.jpg",
"https://static.pexels.com/photos/106399/pexels-photo-106399.jpeg",
"https://static.pexels.com/photos/164516/pexels-photo-164516.jpeg",
"https://static.pexels.com/photos/206172/pexels-photo-206172.jpeg",
"https://static.pexels.com/photos/32870/pexels-photo.jpg"]
There's several moving parts involved, but what I think happens is the following:
WSGI processes are not killed after each request, so things may persist.
You create 3 new threads, but don't let them join the main thread again, for example when the queue is empty.
Since the reference count to Foo.queue never reaches zero (as the threads are still alive waiting for new queue items), it cannot be garbage collected
So you keep creating new threads, new Foo classes and none of them can be freed.
I'm not an expert on queue.Queue, but my theory can be verified if you can watch the number of threads in the WSGI process go up with 3 each request (for example using top(1)).
As a side note, this is a side-effect of your class design. You do everything in __init__, which should really only be assigning class attributes.
I'm trying to update a row on database (the asynchronous way) using the multiprocessing module. My code has a simple function create_member that insert some data on a table and then create a process that maybe will change this data. The problem is that the session passed to async_create_member is closing the database connection, and the next requisition I get psycopg's error:
(Interface Error) connection already closed
Here's the code:
def create_member(self, data):
member = self.entity(**data)
self.session.add(member)
for name in data:
setattr(member, name, data[name])
self.session.commit()
self.session.close()
if self.index.is_indexable:
Process(target=self.async_create_member,
args=(data, self.session)).start()
return member
def async_create_member(self, data, session):
ok, data = self.index.create(data)
if ok:
datacopy = data.copy()
data.clear()
data['document'] = datacopy['document']
data['dt_idx'] = datacopy['dt_idx']
stmt = update(self.entity.__table__).where(
self.entity.__table__.c.id_doc == datacopy['id_doc'])\
.values(**data)
session.begin()
session.execute(stmt)
session.commit()
session.close()
I could possibly solve this by creating a new connetion on async_create_member, but this was leaving too much idle transactions on postgres:
engine = create_new_engine()
conn = engine.connect()
conn.execute(stmt)
conn.close()
What should I do now? is there a way to solve the first code? Or Should I keep creating new connections with create_new_engine function? Should I use threads or processes ?
You can't reuse sessions across threads or processes. Sessions aren't thread safe, and the connectivity that underlies a Session isn't inherited cleanly across processes. The error message you are getting is accurate, if uninformative: the DB connection is indeed closed if you try to use it after inheriting it across a process boundary.
In most cases, yes, you should create a session for each process in a multiprocessing setting.
If your problem meets the following conditions:
you are doing a lot of CPU-intensive processing for each object
database writes are relatively lightweight in comparison
you want to use a lot of processes (I do this on 8+ core machines)
It might be worth your while to create a single writer process that owns a session, and pass the objects to that process. Here's how it usually works for me (Note: not meant to be runnable code):
import multiprocessing
from your_database_layer import create_new_session, WhateverType
work = multiprocessing.JoinableQueue()
def writer(commit_every = 50):
global work
session = create_new_session()
counter = 0
while True:
item = work.get()
if item is None:
break
session.add(item)
counter += 1
if counter % commit_every == 0:
session.commit()
work.task_done()
# Last DB writes
session.commit()
# Mark the final None in the queue as complete
work.task_done()
return
def very_expensive_object_creation(data):
global work
very_expensive_object = WhateverType(**data)
# Perform lots of computation
work.put(very_expensive_object)
return
def main():
writer_process = multiprocessing.Process(target=writer)
writer_process.start()
# Create your pool that will feed the queue here, i.e.
workers = multiprocessing.Pool()
# Dispatch lots of work to very_expensive_object_creation in parallel here
workers.map(very_expensive_object_creation, some_iterable_source_here)
# --or-- in whatever other way floats your boat, such as
workers.apply_async(very_expensive_object_creation, args=(some_data_1,))
workers.apply_async(very_expensive_object_creation, args=(some_data_2,))
# etc.
# Signal that we won't dispatch any more work
workers.close()
# Wait for the creation work to be done
workers.join()
# Trigger the exit condition for the writer
work.put(None)
# Wait for the queue to be emptied
work.join()
return
I am having a synchronization problem while threading with cPython. I have two files, I parse them and return the desired result. However, the code below acts strangely and returns three times instead of two plus doesn't return in the order I put them into queue. Here's the code:
import Queue
import threading
from HtmlDoc import Document
OUT_LIST = []
class Threader(threading.Thread):
"""
Start threading
"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
if self.queue.qsize() == 0: break
path, host = self.queue.get()
f = open(path, "r")
source = f.read()
f.close()
self.out_queue.put((source, host))
self.queue.task_done()
class Processor(threading.Thread):
"""
Process threading
"""
def __init__(self, out_queue):
self.out_queue = out_queue
self.l_first = []
self.f_append = self.l_first.append
self.l_second = []
self.s_append = self.l_second.append
threading.Thread.__init__(self)
def first(self, doc):
# some code to to retrieve the text desired, this works 100% I tested it manually
def second(self, doc):
# some code to to retrieve the text desired, this works 100% I tested it manually
def run(self):
while True:
if self.out_queue.qsize() == 0: break
doc, host = self.out_queue.get()
if host == "first":
self.first(doc)
elif host == "second":
self.second(doc)
OUT_LIST.extend(self.l_first + self.l_second)
self.out_queue.task_done()
def main():
queue = Queue.Queue()
out_queue = Queue.Queue()
queue.put(("...first.html", "first"))
queue.put(("...second.html", "second"))
qsize = queue.qsize()
for i in range(qsize):
t = Threader(queue, out_queue)
t.setDaemon(True)
t.start()
for i in range(qsize):
dt = Processor(out_queue)
dt.setDaemon(True)
dt.start()
queue.join()
out_queue.join()
print '<br />'.join(OUT_LIST)
main()
Now, when I print, I'd like to print the content of the "first" first of all and then the content of the "second". Can anyone help me?
NOTE: I am threading because actually I will have to connect more than 10 places at a time and retrieve its results. I believe that threading is the most appropriate way to accomplish such a task
I am threading because actually I will have to connect more than 10 places at a time and retrieve its results. I believe that threading is the most appropriate way to accomplish such a task
Threading is actually one of the most error-prone ways to manage multiple concurrent connections. A more powerful, more debuggable approach is to use event-driven asynchronous networking, such as implemented by Twisted. If you're interested in using this model, you might want to check out this introduction.
I dont share the same opinion that threading is the best way to do this (IMO some events/select mechanism would be better) but problem with your code could be in variables t and dt. You have the assignements in the cycle and object instances are to stored anywhere - so it may be possible that your new instance of Thread/Processor get deleted at the end of the each cycle.
It would be more clarified if you show us precise output of this code.
1) You cannot control order of job completion. It depends on execution time, so to return results as you want you can create global dictionary with job objects, like job_results : {'first' : None, 'second' : None} and store results here, than you can fetch data on desired order
2) self.first and self.second should be cleared after each processed doc, else you will have duplicates in OUT_LIST
3) You may use multi-processing with subprocess module and put all result data to CSV files for example and them sort them as you wish.