I'm trying to update a row on database (the asynchronous way) using the multiprocessing module. My code has a simple function create_member that insert some data on a table and then create a process that maybe will change this data. The problem is that the session passed to async_create_member is closing the database connection, and the next requisition I get psycopg's error:
(Interface Error) connection already closed
Here's the code:
def create_member(self, data):
member = self.entity(**data)
self.session.add(member)
for name in data:
setattr(member, name, data[name])
self.session.commit()
self.session.close()
if self.index.is_indexable:
Process(target=self.async_create_member,
args=(data, self.session)).start()
return member
def async_create_member(self, data, session):
ok, data = self.index.create(data)
if ok:
datacopy = data.copy()
data.clear()
data['document'] = datacopy['document']
data['dt_idx'] = datacopy['dt_idx']
stmt = update(self.entity.__table__).where(
self.entity.__table__.c.id_doc == datacopy['id_doc'])\
.values(**data)
session.begin()
session.execute(stmt)
session.commit()
session.close()
I could possibly solve this by creating a new connetion on async_create_member, but this was leaving too much idle transactions on postgres:
engine = create_new_engine()
conn = engine.connect()
conn.execute(stmt)
conn.close()
What should I do now? is there a way to solve the first code? Or Should I keep creating new connections with create_new_engine function? Should I use threads or processes ?
You can't reuse sessions across threads or processes. Sessions aren't thread safe, and the connectivity that underlies a Session isn't inherited cleanly across processes. The error message you are getting is accurate, if uninformative: the DB connection is indeed closed if you try to use it after inheriting it across a process boundary.
In most cases, yes, you should create a session for each process in a multiprocessing setting.
If your problem meets the following conditions:
you are doing a lot of CPU-intensive processing for each object
database writes are relatively lightweight in comparison
you want to use a lot of processes (I do this on 8+ core machines)
It might be worth your while to create a single writer process that owns a session, and pass the objects to that process. Here's how it usually works for me (Note: not meant to be runnable code):
import multiprocessing
from your_database_layer import create_new_session, WhateverType
work = multiprocessing.JoinableQueue()
def writer(commit_every = 50):
global work
session = create_new_session()
counter = 0
while True:
item = work.get()
if item is None:
break
session.add(item)
counter += 1
if counter % commit_every == 0:
session.commit()
work.task_done()
# Last DB writes
session.commit()
# Mark the final None in the queue as complete
work.task_done()
return
def very_expensive_object_creation(data):
global work
very_expensive_object = WhateverType(**data)
# Perform lots of computation
work.put(very_expensive_object)
return
def main():
writer_process = multiprocessing.Process(target=writer)
writer_process.start()
# Create your pool that will feed the queue here, i.e.
workers = multiprocessing.Pool()
# Dispatch lots of work to very_expensive_object_creation in parallel here
workers.map(very_expensive_object_creation, some_iterable_source_here)
# --or-- in whatever other way floats your boat, such as
workers.apply_async(very_expensive_object_creation, args=(some_data_1,))
workers.apply_async(very_expensive_object_creation, args=(some_data_2,))
# etc.
# Signal that we won't dispatch any more work
workers.close()
# Wait for the creation work to be done
workers.join()
# Trigger the exit condition for the writer
work.put(None)
# Wait for the queue to be emptied
work.join()
return
Related
I'm having a problem with my multiprocessing and I'm afraid it's a rather simple fix and I'm just not properly implementing the multiprocessing correctly. I've been researching the things that can cause the problem, but all I'm really finding is people recommending the use of a queue to prevent this, but that doesn't seem to be stopping it (again, I may just be implementing the queue incorrectly) I've been at this a couple of days now and I was hoping I could get some help.
Thanks in advance!
import csv
import multiprocessing as mp
import os
import queue
import sys
import time
import connections
import packages
import profiles
def execute_extract(package, profiles, q):
# This is the package execution for the extract
# It fires fine and will print the starting message below
started_at = time.monotonic()
print(f"Starting {package.packageName}")
try:
oracle_connection = connections.getOracleConnection(profiles['oracle'], 1)
engine = connections.getSQLConnection(profiles['system'], 1)
path = os.path.join(os.getcwd(), 'csv_data', package.packageName + '.csv')
cursor = oracle_connection.cursor()
if os.path.exists(path):
os.remove(path)
f = open(path, 'w')
chunksize = 100000
offset = 0
row_total = 0
csv_writer = csv.writer(f, delimiter='^', lineterminator='\n')
# I am having to do some data cleansing. I know this is not the most efficient way to do this, but currently
# it is what I am limited too
while True:
cursor.execute(package.query + f'\r\n OFFSET {offset} ROWS\r\n FETCH NEXT {chunksize} ROWS ONLY')
test = cursor.fetchone()
if test is None:
break
else:
while True:
row = cursor.fetchone()
if row is None:
break
else:
new_row = list(row)
new_row.append(package.sourceId[0])
new_row.append('')
i = 0
for item in new_row:
if type(item) == float:
new_row[i] = int(item)
elif type(item) == str:
new_row[i] = item.encode('ascii', 'replace')
i += 1
row = tuple(new_row)
csv_writer.writerow(row)
row_total += 1
offset += chunksize
f.close()
# I know that execution is at least reaching this point. I can watch the CSV files grow as more and more
# rows are added to the for all the packages What I never get are either the success message or error message
# below, and there are never any entries placed in the tables
query = f"BULK INSERT {profiles['system'].database.split('_')[0]}_{profiles['system'].database.split('_')[1]}_test_{profiles['system'].database.split('_')[2]}.{package.destTable} FROM \"{path}\" WITH (FIELDTERMINATOR='^', ROWTERMINATOR='\\n');"
engine.cursor().execute(query)
engine.commit()
end_time = time.monotonic() - started_at
print(
f"{package.packageName} has completed. Total rows inserted: {row_total}. Total execution time: {end_time} seconds\n")
os.remove(path)
except Exception as e:
print(f'An error has occured for package {package.packageName}.\r\n {repr(e)}')
finally:
# Here is where I am trying to add an item to the queue so the get method in the main def will pick it up and
# remove it from the queue
q.put(f'{package.packageName} has completed')
if oracle_connection:
oracle_connection.close()
if engine:
engine.cursor().close()
engine.close()
if __name__ == '__main__':
# Setting mp creation type
ctx = mp.get_context('spawn')
q = ctx.Queue()
# For the Etl I generate a list of class objects that hold relevant information profs contains a list of
# connection objects (credentials, connection strings, etc) packages contains the information to run the extract
# (destination tables, query string, package name for logging, etc)
profs = profiles.get_conn_vars(sys.argv[1])
packages = packages.get_etl_packages(profs)
processes = []
# I'm trying to track both individual package execution time and overall time so I can get an estimate on rows
# per second
start_time = time.monotonic()
sqlConn = connections.getSQLConnection(profs['system'])
# Here I'm executing a SQL command to truncate all my staging tables to ensure they are empty and will not
# generate any key violations
sqlConn.execute(
f"USE[{profs['system'].database.split('_')[0]}_{profs['system'].database.split('_')[1]}_test_{profs['system'].database.split('_')[2]}]\r\nExec Sp_msforeachtable #command1='Truncate Table ?',#whereand='and Schema_Id=Schema_id(''my_schema'')'")
# Here is where I start generating a process per package to try and get all packages to run simultaneously
for package in packages:
p = ctx.Process(target=execute_extract, args=(package, profs, q,))
processes.append(p)
p.start()
# Here is my attempt at managing the queue. This is a monstrosity of fixes I've tried to get this to work
results = []
while True:
try:
result = q.get(False, 0.01)
results.append(result)
except queue.Empty:
pass
allExited = True
for t in processes:
if t.exitcode is None:
allExited = False
break
if allExited & q.empty():
break
for p in processes:
p.join()
# Closing out the end time and writing the overall execution time in minutes.
end_time = time.monotonic() - start_time
print(f'Total execution time of {end_time / 60} minutes.')
I can't be sure why you are experiencing a deadlock (I am not at all convinced it is related to your queue management), but I can say for sure that you can simplify your queue management logic if you do one of either two things:
Method 1
Ensure that your worker function, execute_extract will put something on the results queue even in the case of an exception (I would recommend placing the Exception object itself). Then your entire main process loop that begins with while True: that attempts to get the results can be replaced with:
results = [q.get() for _ in range(len(processes))]
You are guaranteed that there will be a fixed number of messages on the queue equal to the number of processes created.
Method 2 (even simpler)
Simply reverse the order in which you wait for the subprocesses to complete and you process the results queue. You don't know how many messages will be on the queue but you aren't processing the queue until all the processes have returned. So however many messages are on the queue is all you will ever get. Just retrieve them until the queue is empty:
for p in processes:
p.join()
results = []
while not q.empty():
results.append(q.get())
At this point I would normally suggest that you use a multiprocessing pool class such as multiprocessing.Pool which does not require an explicit queue to retrieve results. But make either of these changes (I suggest Method 2, as I cannot see how it can cause a deadlock since only the main process is running at this point) and see if your problem goes away. I am not, however, guaranteeing that your issue is not somewhere else in your code. While your code is overly complicated and inefficient, it is not obviously "wrong." At least you will know whether your problem is elsewhere.
And my question for you: What does it buy you to do everything using a context acquired with ctx = mp.get_context('spawn') instead of just calling the methods on the multiprocessing module itself? If your platform had support for a fork call, which would be the default context, would you not want to use that?
I am working on a threaded application where one thread will feed a Queue with objects to be modified and a number of other threads will then read from the queue, do the modifications and save the changes.
The application won't need a lot of concurrency, so I would like to stick to an SQLite database. Here is a small example illustrating the application:
import queue
import threading
import peewee as pw
db = pw.SqliteDatabase('test.db', threadlocals=True)
class Container(pw.Model):
contents = pw.CharField(default="spam")
class Meta:
database = db
class FeederThread(threading.Thread):
def __init__(self, input_queue):
super().__init__()
self.q = input_queue
def run(self):
containers = Container.select()
for container in containers:
self.q.put(container)
class ReaderThread(threading.Thread):
def __init__(self, input_queue):
super().__init__()
self.q = input_queue
def run(self):
while True:
item = self.q.get()
with db.execution_context() as ctx:
# Get a new connection to the container object:
container = Container.get(id=item.id)
container.contents = "eggs"
container.save()
self.q.task_done()
if __name__ == "__main__":
db.connect()
try:
db.create_tables([Container,])
except pw.OperationalError:
pass
else:
[Container.create() for c in range(42)]
db.close()
q = queue.Queue(maxsize=10)
feeder = FeederThread(q)
feeder.setDaemon(True)
feeder.start()
for i in range(10):
reader = ReaderThread(q)
reader.setDaemon(True)
reader.start()
q.join()
Based on the peewee docs multi-threading should be supported for SQLite. However, I keep getting the infamous peewee.OperationalError: database is locked error with the error output pointing to the container.save() line.
How do I get around this?
I was kind of surprised to see this failing as well, so I copied your code and played around with some different ideas. What I think the problem is, is that ExecutionContext() by default will cause the wrapped block to run in a transaction. To avoid this, I passed in False in the reader threads.
I also edited the feeder to consume the SELECT statement before putting stuff into the queue (list(Container.select())).
The following works for me locally:
class FeederThread(threading.Thread):
def __init__(self, input_queue):
super(FeederThread, self).__init__()
self.q = input_queue
def run(self):
containers = list(Container.select())
for container in containers:
self.q.put(container.id) # I don't like passing model instances around like this, personal preference though
class ReaderThread(threading.Thread):
def __init__(self, input_queue):
super(ReaderThread, self).__init__()
self.q = input_queue
def run(self):
while True:
item = self.q.get()
with db.execution_context(False):
# Get a new connection to the container object:
container = Container.get(id=item)
container.contents = "nuggets"
with db.atomic():
container.save()
self.q.task_done()
if __name__ == "__main__":
with db.execution_context():
try:
db.create_tables([Container,])
except OperationalError:
pass
else:
[Container.create() for c in range(42)]
# ... same ...
I'm not wholly satisfied with this, but hopefully it gives you some ideas.
Here's a blog post I wrote a while back that has some tips for getting higher concurrency with SQLite: http://charlesleifer.com/blog/sqlite-small-fast-reliable-choose-any-three-/
Have you tried WAL mode?
Improve INSERT-per-second performance of SQLite?
You have to be quite careful if you have concurrent access to SQLite, as the whole database is locked when writes are done, and although multiple readers are possible, writes will be locked out. This has been improved somewhat with the addition of a WAL in newer SQLite versions.
and
If you are using multiple threads, you can try using the shared page cache, which will allow loaded pages to be shared between threads, which can avoid expensive I/O calls.
Note: I want to implement this without using any framework.
I have to create an web application using python. The application should maintain a running average of the CPU usage for each process over the past 60 seconds. It should should act as a web server and when it gets a request, it should return the current average for each process. Following are the scripts I've written. record_usage.py is a script which I want to run as soon as the server.py is run. So that it runs and maintain the cpu usage data, which I intend to read whenever I get an XHR request and send it back to the client.
So, my problem is how do I invoke this requirement? I tried running record_usage.py using subprocess.POPEN after starting the server. record_usage.py starts running in background as well. But when I try accessing the data created by it, the class object I create is not the one it uses but a new one. How to complete this link?
Kindly ask things that I could not make clear.
Latest changes in server.py
if __name__ == '__main__':
RU_OBJ = RU(settings.SAMPLING_FREQ, settings.AVG_INTERVAL)
RU_LOCK = RLock()
# Record CPU usage in a thread.
ru_thread = Thread(target=RU_OBJ.record, args=(RU_LOCK,))
ru_thread.daemon = True
ru_thread.start()
# Run server.
run()
Latest change in record_usage.py
def record(self, lock):
while True:
with lock:
self.add_processes()
time.sleep(self.sampling_freq)
Is this a proper way of applying locks? A similar lock is being applied when am reading the processes information. Would it work?
Added the functions:
def add_processes(self,):
for _process in psutil.process_iter():
try:
new_proc = _process.as_dict(attrs=['cpu_times', 'name', 'pid',
'status'])
except psutil.NoSuchProcess:
continue
pid, (user, _sys) = new_proc['pid'], new_proc.pop('cpu_times')
# Get or create details object for the process.
existing = self.processes.setdefault(pid, new_proc)
# Get or create queue object for the CPU times of the process.
queue_dict = self.process_queue.setdefault(pid, dict())
# User CPU time.
user_q = queue_dict.setdefault('user_q', PekableQueue(self.avg_interval))
user_q.enqueue(user)
user_avg = get_avg(user_q)
# System CPU time.
sys_q = queue_dict.setdefault('sys_q', PekableQueue(self.avg_interval))
sys_q.enqueue(_sys)
sys_avg = get_avg(sys_q)
# Update the details object for the process.
existing.update(user_avg=user_avg, sys_avg=sys_avg, **new_proc)
def get_curr_processes(self):
return [self.processes[pid] for pid in psutil.get_pid_list()
if pid in self.processes]
To collect statistics in another thread:
if __name__ == '__main__':
from threading import Thread, Lock
import record_usage
lock = Lock()
t = Thread(target=record_usage.record, args=[lock])
t.daemon = True
t.start()
run(lock)
If you change some shared data in one thread and read it in another then you could protect the places where you access/change the value with a lock:
#...
with self.lock:
existing = self.processes.setdefault(pid, new_proc)
#...
with self.lock:
existing.update(user_avg=user_avg, sys_avg=sys_avg, **new_proc)
#...
def get_curr_processes(self):
with self.lock:
return [self.processes[pid] for pid in psutil.get_pid_list()
if pid in self.processes]
It is essential that self.lock is the same object in all threads. If self.processes is a dict then you don't need to use a lock in CPython. The methods are implemented in C and the interpreter doesn't release GIL (global lock) while calling them i.e., only one thread at a time accesses the dict.
I use Python 2.7 and a SQLite3 database. I want to run update queries on the database that can take some time. On the other hand I don't want that the user has to wait.
Therefore I want to start a new thread to do the database updating.
Python throws an error. Is there an effective way to tell the database to do the update in it's own thread without having to wait for the thread to finish?
line 39, in execute
ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 3648 and this is thread id 6444
As far as examples go I'm trying to write an Anki addon. The Addon code that produces that error is:
from anki.sched import Scheduler
import threading
def burying(self, card):
buryingThread = threading.Thread(target = self._burySiblings, args = (card,))
buryingThread.start()
def newGetCard(self):
"Pop the next card from the queue. None if finished."
self._checkDay()
if not self._haveQueues:
self.reset()
card = self._getCard()
if card:
burying(self, card)
self.reps += 1
card.startTimer()
return card
__oldFunc = Scheduler.getCard
Scheduler.getCard = newGetCard
Check out celery project, and if you're using django it will be even more straight forward www.celeryproject.org
I've got an event-driven chatbot and I'm trying to implement spam protection. I want to silence a user who is behaving badly for a period of time, without blocking the rest of the application.
Here's what doesn't work:
if user_behaving_badly():
ban( user )
time.sleep( penalty_duration ) # Bad! Blocks the entire application!
unban( user )
Ideally, if user_behaving_badly() is true, I want to start a new thread which does nothing but ban the user, then sleep for a while, unban the user, and then the thread disappears.
According to this I can accomplish my goal using the following:
if user_behaving_badly():
thread.start_new_thread( banSleepUnban, ( user, penalty ) )
"Simple" is usually an indicator of "good", and this is pretty simple, but everything I've heard about threads has said that they can bite you in unexpected ways. My question is: Is there a better way than this to run a simple delay loop without blocking the rest of the application?
instead of starting a thread for each ban, put the bans in a priority queue and have a single thread do the sleeping and unbanning
this code keeps two structures a heapq that allows it to quickly find the soonest ban to expire and a dict to make it possible to quickly check if a user is banned by name
import time
import threading
import heapq
class Bans():
def __init__(self):
self.lock = threading.Lock()
self.event = threading.Event()
self.heap = []
self.dict = {}
self.thread = threading.thread(target=self.expiration_thread)
self.thread.setDaemon(True)
self.thread.start()
def ban_user(self, name, duration):
with self.lock:
now = time.time()
expiration = (now+duration)
heapq.heappush(self.heap, (expiration, user))
self.dict[user] = expiration
self.event.set()
def is_user_banned(self, user):
with self.lock:
now = time.time()
return self.dict.get(user, None) > now
def expiration_thread(self):
while True:
self.event.wait()
with self.lock:
next, user = self.heap[0]
now = time.time()
duration = next-now
if duration > 0:
time.sleep(duration)
with self.lock:
if self.heap[0][0] = next:
heapq.heappop(self.heap)
del self.dict(user)
if not self.heap:
self.event.clear()
and is used like this:
B = Bans()
B.ban_user("phil", 30.0)
B.is_user_banned("phil")
Use a threading timer object, like this:
t = threading.Timer(30.0, unban)
t.start() # after 30 seconds, unban will be run
Then only unban is run in the thread.
Why thread at all?
do_something(user):
if(good_user(user)):
# do it
else
# don't
good_user():
if(is_user_baned(user)):
if(past_time_since_ban(user)):
user_good_user(user)
elif(is_user_bad()):
ban_user()
ban_user(user):
# add a user/start time to a hash
is_user_banned()
# check hash
# could check if expired now too, or do it seperately if you care about it
is_user_bad()
# check params or set more values in a hash
This is language agnostic, but consider a thread to keep track of stuff. The thread keeps a data structure that has something like "username" and "banned_until" in a table. The thread is always running in the background checking the table, if banned_until is expired, it unblocks the user. Other threads go on normally.
If you're using a GUI,
most GUI modules have a timer function which can abstract all the yuck multithreading stuff,
and execute code after a given time,
though still allowing the rest of the code to be executed.
For instance, Tkinter has the 'after' function.